CN115272906A - Video background portrait segmentation model and algorithm based on point rendering - Google Patents

Video background portrait segmentation model and algorithm based on point rendering Download PDF

Info

Publication number
CN115272906A
CN115272906A CN202210699582.1A CN202210699582A CN115272906A CN 115272906 A CN115272906 A CN 115272906A CN 202210699582 A CN202210699582 A CN 202210699582A CN 115272906 A CN115272906 A CN 115272906A
Authority
CN
China
Prior art keywords
resolution
low
module
feature map
video background
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210699582.1A
Other languages
Chinese (zh)
Inventor
张笑钦
徐航
林盛
黎敏
周杰
陈熙祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou University
Original Assignee
Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou University filed Critical Wenzhou University
Priority to CN202210699582.1A priority Critical patent/CN115272906A/en
Publication of CN115272906A publication Critical patent/CN115272906A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a video background portrait segmentation model based on point rendering, which relates to the technical field of video image processing and comprises the following steps: the invention discloses a video background portrait segmentation algorithm based on point rendering, which comprises a Back bone network module, a multi-scale feature fusion module, a segmentation prediction module and a neural network module based on point rendering, wherein the Back bone network module is used for performing initial feature extraction on a video background portrait to be processed, the multi-scale feature fusion module is used for performing feature fusion to obtain a detection object, the segmentation prediction module is used for generating rough mask prediction for the detection object, and the neural network module based on point rendering is used for performing segmentation prediction.

Description

Video background portrait segmentation model and algorithm based on point rendering
Technical Field
The invention relates to the technical field of video image processing, in particular to a video background portrait segmentation model and an algorithm based on point rendering.
Background
In recent years, the portrait segmentation by using a deep learning model has achieved remarkable success in terms of performance and accuracy, and has been widely applied to practical scenes such as mobile phone photography, video monitoring, face recognition and the like, and the portrait segmentation has higher requirements on precision and speed compared with other semantic segmentation tasks.
With the popularization of smart phones and social portal websites (such as microblog and Facebook), personal photos can be captured and shared more conveniently, and many image editing software have a portrait segmentation function, so that the regions corresponding to the head and the upper body of the portrait image are separated from the background for subsequent operations, such as background replacement, hairstyle change, portrait stylization and the like.
The human image segmentation is an important research subject of computer vision, and can be widely applied to various fields, such as changing a background of a video, integrating characters into different scenes to generate creative applications, and also can quickly design creative pictures aiming at the detail processing of images.
Therefore, it is a problem to be solved by those skilled in the art to provide a new technical solution to improve the above problems.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a video background portrait segmentation model and algorithm based on point rendering for effective segmentation, and a portrait main body can be accurately segmented in a complex scene.
A video background portrait segmentation model based on point rendering, comprising: the device comprises a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module.
In the above scheme, the backhaul network module is configured to perform preliminary feature extraction on a video background portrait to be processed, and obtain low-resolution features of multiple scales and high-resolution features of multiple scales.
In the above scheme, the multi-scale feature fusion module is configured to perform feature aggregation on the high-resolution features output by the backhaul network module, perform upsampling feature filling on the low-resolution features, and fuse the upsampled low-resolution features with the feature aggregated high-resolution features to obtain the detection object.
In the foregoing solution, the segmentation prediction module is configured to generate a rough mask prediction for a detection object output by the multi-scale feature fusion module.
In the above scheme, the neural network module based on point rendering is configured to perform segmentation prediction by using an iterative subdivision algorithm to obtain a fine-grained mask.
In the above scheme, the backhaul network module main body includes a first stage network, a second stage network, a third stage network, and a fourth stage network, where the first stage network includes a first downsampling unit and a first residual error unit, the first downsampling unit is configured to downsample, by 4 times, a video background portrait to be processed, obtain a low-resolution feature map of a size 1/4 of the video background portrait to be processed, and extract a low-resolution feature, the first residual error unit is configured to perform multiple convolution operations on the low-resolution feature map output by the first downsampling unit, expand the number of channels of the low-resolution feature map output by the first downsampling unit to 18, obtain a first high-resolution feature map, and extract a high-resolution feature, and the first residual error unit employs 4 bottomleneck residual error modules; the second-stage network comprises a second downsampling unit and a second residual error unit, wherein the second downsampling unit is used for downsampling a video background portrait to be processed by 8 times, acquiring a low-resolution feature map of the size of 1/8 of the video background portrait to be processed and extracting low-resolution features, the second residual error unit is used for performing convolution operation on the low-resolution feature map output by the second downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 36, acquiring a second high-resolution feature map and extracting the high-resolution features, and the second residual error unit adopts 1 basic Block residual module; the third-stage network comprises a third downsampling unit and a third residual error unit, wherein the third downsampling unit is used for performing 16-time downsampling on a video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/16 of the video background portrait to be processed and extracting low-resolution features, the third residual error unit is used for performing convolution operation on the low-resolution feature map output by the third downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 72, obtaining a third high-resolution feature map and extracting high-resolution features, and the third residual error unit adopts 4 BasicBlock residual error modules; the fourth-stage network comprises a fourth downsampling unit and a fourth residual unit, wherein the fourth downsampling unit is used for downsampling 32 times of the video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/32 of the video background portrait to be processed and extracting low-resolution features, the fourth residual unit is used for carrying out convolution operation on the low-resolution feature map output by the fourth downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the fourth downsampling unit to 144, obtaining a fourth high-resolution feature map and extracting the low-resolution features, and the fourth residual unit adopts 3 BasicBlock residual modules.
In the above solution, the width parameter of the bottleeck residual module is 64, the bottleeck residual module includes a first 1 × 1 convolutional layer, a 3 × 3 convolutional layer, and a second 1 × 1 convolutional layer, and the first 1 × 1 convolutional layer, the 3 × 3 convolutional layer, and the second 1 × 1 convolutional layer all include a BN layer and a ReLU function.
In the above scheme, the BasicBlock residual module includes a first 3 × 3 convolutional layer and a second 3 × 3 convolutional layer, and the first 3 × 3 convolutional layer and the second 3 × 3 convolutional layer are followed by a BN layer and a ReLU function.
The invention also provides a video background portrait segmentation algorithm based on point rendering, which is applied to the video background portrait segmentation model based on point rendering for processing and comprises the following steps:
step S1: performing primary feature extraction on a video background portrait to be processed through a backhaul network module to obtain low-resolution feature maps in multiple scales and high-resolution feature maps in multiple scales;
step S2: after feature aggregation is carried out on the high-resolution features output by the Backbone network module through a multi-scale feature fusion module, up-sampling feature filling is carried out on the low-resolution features, and the low-resolution features subjected to up-sampling processing and the high-resolution features subjected to feature aggregation are fused to obtain a detection object;
and step S3: generating a rough mask prediction for the detection object output by the multi-scale feature fusion module through a segmentation prediction module;
and step S4: and performing segmentation prediction by adopting an iterative subdivision algorithm through a neural network module based on point rendering to obtain a mask with fine granularity.
In the foregoing solution, after performing feature aggregation on the high resolution features output by the backhaul network module through the multi-scale feature fusion module, performing upsampling feature filling on the low resolution features, and fusing the low resolution features subjected to the upsampling with the high resolution features subjected to the feature aggregation to obtain the detection object includes:
step S21: adopting 1 or a plurality of 3 multiplied by 3 convolutional layers with the step length of 2 to carry out down-sampling operation on each high-resolution feature map output by the Backbone network module, leading each high-resolution feature map to have the same size with the corresponding low-resolution feature map, and aggregating the features of the high-resolution feature maps which are transformed by the down-sampling operation;
step S22: performing upsampling operation on each low-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 × 3 convolutional layers to enable each low-resolution feature map to be the same as the corresponding high-resolution feature map in size, performing identity mapping on the low-resolution feature map subjected to the upsampling operation through the 1 × 1 convolutional layers, and fusing the features of the low-resolution feature map and the high-resolution features subjected to aggregation processing according to a feature channel;
step S23: and outputting a fusion result.
In the foregoing aspect, the generating, by the segmentation prediction module, a rough mask prediction for the detection object output by the multi-scale feature fusion module includes:
step S31: selecting a group of points from the detection objects output by the multi-scale feature fusion module through a PointRend network;
step S32: carrying out independent prediction on each point through a small multilayer perceptron;
step S33: and iterating the independent prediction process through a subdivision mask rendering algorithm, and roughly predicting the mask of the detection object through a fine-granularity image recognition algorithm.
In the foregoing solution, the obtaining a fine-grained mask by performing segmentation prediction by using an iterative subdivision algorithm through a neural network module based on point rendering includes:
step S41: inputting the fusion result output by the multi-scale feature fusion module into a PointRend network;
step S42: and taking the mask output by the segmentation prediction module as a coarse prediction, and performing segmentation prediction by using a position execution point adaptively selected by an iterative subdivision algorithm through PointRend network processing to obtain a fine-grained mask.
In conclusion, the beneficial effects of the invention are as follows: a point rendering-based video background portrait segmentation model comprising a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module is built by adopting a deep learning PyTorch framework, and a mask with fine boundary details is generated for a portrait video and a picture through the point rendering-based video background portrait segmentation model, so that the portrait main body segmentation can be rapidly and accurately carried out on the portrait video, the video is further processed, and an excellent visual processing effect is achieved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.
Fig. 1 is a schematic diagram of a network structure of a video background portrait segmentation model based on point rendering in the present invention.
Fig. 2 is a schematic diagram of a network structure of the multi-scale feature fusion module according to the present invention.
FIG. 3 is a schematic diagram of a network structure of a BasicBlock residual module and a bottleeck residual module in the present invention.
FIG. 4 is a step diagram of a video background human image segmentation algorithm based on point rendering according to the present invention.
FIG. 5 is a diagram illustrating the steps of feature fusion performed by the multi-scale feature fusion module according to the present invention.
FIG. 6 is a diagram illustrating the steps of coarse mask prediction performed by the partition prediction module according to the present invention.
FIG. 7 is a diagram of the steps of segmentation prediction by a neural network module based on point rendering according to the present invention.
FIG. 8 is a diagram illustrating the effect of segmenting the human image of the video image in different scenes according to the present invention.
Detailed Description
In order to make the objects and advantages of the technical solutions of the present invention clearer, the technical solutions implemented by the present invention will be clearly and completely described below with reference to the accompanying drawings of specific embodiments of the present invention. Like reference symbols in the various drawings indicate like elements. It should be noted that the described embodiments are part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without inventive step, are within the scope of protection of the invention.
As shown in fig. 1 and fig. 2, a video background human image segmentation model based on point rendering according to the present invention includes: the device comprises a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module.
The connection relationship between the above modules of the present invention will be further described in detail with reference to the accompanying drawings.
The Backbone network module is used for performing primary feature extraction on the video background portrait to be processed to obtain low-resolution features of multiple scales and high-resolution features of multiple scales; the multi-scale feature fusion module is used for performing feature aggregation on the high-resolution features output by the backhaul network module, performing up-sampling feature filling on the low-resolution features, and fusing the low-resolution features subjected to up-sampling processing with the high-resolution features subjected to feature aggregation to obtain a detection object; the segmentation prediction module is used for generating a rough mask prediction for the detection object output by the multi-scale feature fusion module; the neural network module based on the point rendering is used for carrying out segmentation prediction by adopting an iterative subdivision algorithm to obtain a mask with fine granularity.
In this embodiment, a video background portrait segmentation model based on point rendering is built by adopting a deep learning PyTorch frame.
In this embodiment, the backhaul network module may obtain semantic information and detail information of the enhanced portrait features, and obtain position information of the character target in the portrait on the background of the video.
In this embodiment, the multi-scale feature fusion module combines semantic information, detail information, and position information of a human target, and may detect objects in different scales.
Further, the backhaul network module main body includes a first stage network, a second stage network, a third stage network, and a fourth stage network, where the first stage network includes a first downsampling unit and a first residual error unit, the first downsampling unit is configured to perform 4-fold downsampling on a video background portrait to be processed, obtain a low-resolution feature map of a size of 1/4 of the video background portrait to be processed, and extract a low-resolution feature, the first residual error unit is configured to perform multiple convolution operations on the low-resolution feature map output by the first downsampling unit, expand the number of channels of the low-resolution feature map output by the first downsampling unit to 18, obtain a first high-resolution feature map, and extract a high-resolution feature, and the first residual error unit employs 4 bottellen residual error modules; the second-stage network comprises a second downsampling unit and a second residual error unit, wherein the second downsampling unit is used for downsampling 8 times of a video background portrait to be processed, acquiring a low-resolution feature map with the size of 1/8 of the video background portrait to be processed and extracting low-resolution features, the second residual error unit is used for performing convolution operation on the low-resolution feature map output by the second downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 36, acquiring a second high-resolution feature map and extracting high-resolution features, and the second residual error unit adopts 1 BasicBlock residual error module; the third-stage network comprises a third downsampling unit and a third residual error unit, wherein the third downsampling unit is used for downsampling 16 times of a video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/16 of the video background portrait to be processed and extracting low-resolution features, the third residual error unit is used for carrying out convolution operation on the low-resolution feature map output by the third downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 72, obtaining a third high-resolution feature map and extracting high-resolution features, and the third residual error unit adopts 4 BasicBlock residual error modules; the fourth-stage network comprises a fourth downsampling unit and a fourth residual unit, the fourth downsampling unit is used for downsampling 32 times of the video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/32 of the video background portrait to be processed and extracting low-resolution features, the fourth residual unit is used for performing convolution operation on the low-resolution feature map output by the fourth downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the fourth downsampling unit to 144, obtaining a fourth high-resolution feature map and extracting the low-resolution features, and the fourth residual unit adopts 3 BasicBlock residual modules.
In this embodiment, the output channels of the first-stage network, the second-stage network, the third-stage network, and the fourth-stage network are C, 2C, 4C, and 8C, respectively.
As shown in fig. 3, the width parameter of the bottommost residual module is 64, the bottommost residual module includes a first 1 × 1 convolutional layer, a 3 × 3 convolutional layer, and a second 1 × 1 convolutional layer, and the first 1 × 1 convolutional layer, the 3 × 3 convolutional layer, and the second 1 × 1 convolutional layer each include a BN layer and a ReLU function.
Further, the BasicBlock residual module includes a first 3 x 3 convolutional layer and a second 3 x 3 convolutional layer, the first 3 x 3 convolutional layer and the second 3 x 3 convolutional layer being post-homobn layer and a ReLU function.
As shown in fig. 4, the present invention further provides a point-rendering-based video background portrait segmentation algorithm, which is applied to the point-rendering-based video background portrait segmentation model for processing, and includes:
step S1: performing primary feature extraction on a video background portrait to be processed through a backhaul network module to obtain low-resolution feature maps in multiple scales and high-resolution feature maps in multiple scales;
step S2: after feature aggregation is carried out on the high-resolution features output by the backhaul network module through a multi-scale feature fusion module, up-sampling feature filling is carried out on the low-resolution features, and the low-resolution features subjected to up-sampling processing and the high-resolution features subjected to feature aggregation are fused to obtain a detection object;
and step S3: generating a rough mask prediction for the detection object output by the multi-scale feature fusion module through a segmentation prediction module;
and step S4: and performing segmentation prediction by adopting an iterative subdivision algorithm through a neural network module based on point rendering to obtain a mask with fine granularity.
As shown in fig. 5, after performing feature aggregation on the high-resolution features output by the backhaul network module through the multi-scale feature fusion module, performing upsampling feature filling on the low-resolution features, and fusing the low-resolution features subjected to the upsampling process with the high-resolution features subjected to the feature aggregation to obtain the detection object includes:
step S21: carrying out down-sampling operation on each high-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 x 3 convolution layers with the step length of 2, so that each high-resolution feature map has the same size with the corresponding low-resolution feature map, and aggregating the features of the high-resolution feature maps subjected to the conversion processing of the down-sampling operation;
step S22: carrying out upsampling operation on each low-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 x 3 convolutional layers to ensure that each low-resolution feature map has the same size with a corresponding high-resolution feature map, carrying out identity mapping on the low-resolution feature map subjected to the upsampling operation by the 1 x 1 convolutional layers, and fusing the features of the low-resolution feature map with the high-resolution features subjected to aggregation processing according to a feature channel;
step S23: and outputting a fusion result.
As shown in fig. 6, the generating, by the segmentation prediction module, a rough mask prediction for the detected object output by the multi-scale feature fusion module includes:
step S31: selecting a group of points from the detection objects output by the multi-scale feature fusion module through a PointRend network;
step S32: carrying out independent prediction on each point through a small multilayer perceptron;
step S33: and iterating the independent prediction process through a subdivision mask rendering algorithm, and roughly predicting the mask of the detection object through a fine-granularity image recognition algorithm.
In this embodiment, the small multi-layer perceptron MLP uses a fine-grained feature map coarse prediction mask, with interpolation features computed at these points as input, and the coarse-grained mask function enables the small multi-layer perceptron MLP to make different predictions at a single point containing two or more boxes, during which a subdivided mask rendering algorithm is iteratively applied to refine the uncertainty region of the prediction mask.
As shown in fig. 7, the obtaining of the fine-grained mask by performing segmentation prediction by using an iterative subdivision algorithm through the neural network module based on point rendering includes:
step S41: inputting the fusion result output by the multi-scale feature fusion module into a PointRend network;
step S42: and taking the mask output by the segmentation prediction module as a coarse prediction, and performing segmentation prediction by using a position execution point adaptively selected by an iterative subdivision algorithm through PointRend network processing to obtain a fine-grained mask.
In this embodiment, the neural network module based on point rendering selects a small number of real value points for prediction, avoids performing over-computation on all pixels in the high-resolution output grid, and extracts point-by-point feature representation for each selected point. The feature of the true value point is computed by bilinear interpolation of the feature map, using the 4 nearest neighbors of the point on the regular grid of the feature map, and as a result it is able to predict partitions with higher resolution than the feature map using the sub-pixel information encoded in the channel dimensions of the feature map.
In this embodiment, in the iterative operation process of the PointRend network processing using the iterative subdivision algorithm, the PointRend network uses the bilinear interpolation algorithm to perform upsampling on the segmentation result of the previous prediction, then selects N points with the most uncertainty on the grid with higher density, for example, selects a point with a probability of being close to 0.5 obtained for binary prediction, then calculates a point feature representation for each point of the N points, predicts a label of each point, and repeats the above process until the segmentation is up-sampled to the required resolution.
As shown in fig. 8, an effect diagram of segmenting a portrait of a video image in different scenes is shown, and it can be seen that the scheme provided by the present invention can accurately segment a portrait subject in different complex scenes, and perform edge refinement.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A video background portrait segmentation model based on point rendering is characterized by comprising:
the system comprises a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module;
the Backbone network module is used for performing primary feature extraction on the video background portrait to be processed to obtain low-resolution features of multiple scales and high-resolution features of multiple scales;
the multi-scale feature fusion module is used for performing feature aggregation on the high-resolution features output by the Backbone network module, performing up-sampling feature filling on the low-resolution features, and fusing the low-resolution features subjected to up-sampling processing and the high-resolution features subjected to feature aggregation to obtain a detection object;
the segmentation prediction module is used for generating a rough mask prediction for the detection object output by the multi-scale feature fusion module;
the neural network module based on the point rendering is used for carrying out segmentation prediction by adopting an iterative subdivision algorithm to obtain a mask with fine granularity.
2. The video background portrait segmentation model based on point rendering according to claim 1, wherein the Backbone network module body includes a first stage network, a second stage network, a third stage network and a fourth stage network, the first stage network includes a first downsampling unit and a first residual error unit, the first downsampling unit is configured to perform 4-fold downsampling on a video background portrait to be processed, obtain a low resolution feature map of a size of 1/4 of the video background portrait to be processed, and extract low resolution features, the first residual error unit is configured to perform multiple convolution operations on the low resolution feature map output by the first downsampling unit, expand a channel number of the low resolution feature map output by the first downsampling unit to 18, obtain a first high resolution feature map, and extract high resolution features, and the first residual error unit employs 4 bootstrap residual error modules; the second-stage network comprises a second downsampling unit and a second residual error unit, wherein the second downsampling unit is used for downsampling 8 times of a video background portrait to be processed to obtain a low-resolution feature map with the size of 1/8 of the video background portrait to be processed and extract low-resolution features, the second residual error unit is used for carrying out convolution operation on the low-resolution feature map output by the second downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 36 to obtain a second high-resolution feature map and extracting the high-resolution features, and the second residual error unit adopts 1 BasicBlock residual error module; the third-stage network comprises a third downsampling unit and a third residual error unit, wherein the third downsampling unit is used for downsampling 16 times of a video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/16 of the video background portrait to be processed and extracting low-resolution features, the third residual error unit is used for carrying out convolution operation on the low-resolution feature map output by the third downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 72, obtaining a third high-resolution feature map and extracting high-resolution features, and the third residual error unit adopts 4 Basicck block residual error modules; the fourth-stage network comprises a fourth downsampling unit and a fourth residual unit, the fourth downsampling unit is used for downsampling 32 times of the video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/32 of the video background portrait to be processed and extracting low-resolution features, the fourth residual unit is used for performing convolution operation on the low-resolution feature map output by the fourth downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the fourth downsampling unit to 144, obtaining a fourth high-resolution feature map and extracting the low-resolution features, and the fourth residual unit adopts 3 BasICBlock residual modules.
3. The point-rendering-based video background portrait segmentation model of claim 1, wherein the width parameter of the bottleeck residual module is 64, the bottleeck residual module includes a first 1 × 1 convolution layer, a 3 × 3 convolution layer, and a second 1 × 1 convolution layer, and the first 1 × 1 convolution layer, the 3 × 3 convolution layer, and the second 1 × 1 convolution layer each include a BN layer and a ReLU function.
4. The point-rendering based video background portrait segmentation model of claim 1, wherein the BasicBlock residual module includes a first 3 x 3 convolutional layer and a second 3 x 3 convolutional layer, the first 3 x 3 convolutional layer and the second 3 x 3 convolutional layer being followed by a mean BN layer and a ReLU function.
5. A point-rendering-based video background human image segmentation algorithm applied to the point-rendering-based video background human image segmentation model of any one of claims 1 to 4 for processing, comprising:
step S1: performing primary feature extraction on a video background portrait to be processed through a backhaul network module to obtain low-resolution feature maps in multiple scales and high-resolution feature maps in multiple scales;
step S2: after feature aggregation is carried out on the high-resolution features output by the Backbone network module through a multi-scale feature fusion module, up-sampling feature filling is carried out on the low-resolution features, and the low-resolution features subjected to up-sampling processing and the high-resolution features subjected to feature aggregation are fused to obtain a detection object;
and step S3: generating a rough mask prediction for the detection object output by the multi-scale feature fusion module through a segmentation prediction module;
and step S4: and performing segmentation prediction by adopting an iterative subdivision algorithm through a neural network module based on point rendering to obtain a mask with fine granularity.
6. The video background portrait segmentation algorithm based on point rendering according to claim 5, wherein the obtaining of the detection object includes, after feature aggregation is performed on the high resolution features output by the backhaul network module through the multi-scale feature fusion module, performing upsampling feature filling on the low resolution features, and fusing the upsampled low resolution features and the feature aggregated high resolution features:
step S21: performing down-sampling operation on each high-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 × 3 convolution layers with the step length of 2, so that each high-resolution feature map has the same size as the corresponding low-resolution feature map, and aggregating the features of the high-resolution feature maps subjected to the down-sampling operation conversion;
step S22: performing upsampling operation on each low-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 × 3 convolutional layers to enable each low-resolution feature map to be the same as the corresponding high-resolution feature map in size, performing identity mapping on the low-resolution feature map subjected to the upsampling operation through the 1 × 1 convolutional layers, and fusing the features of the low-resolution feature map and the high-resolution features subjected to aggregation processing according to the feature channels;
step S23: and outputting a fusion result.
7. The point-rendering-based video background portrait segmentation algorithm of claim 5, wherein the generating a rough mask prediction for the detected object output by the multi-scale feature fusion module through the segmentation prediction module comprises:
step S31: selecting a group of points from the detection objects output by the multi-scale feature fusion module through a PointRend network;
step S32: carrying out independent prediction on each point through a small multilayer perceptron;
step S33: and iterating the independent prediction process through a subdivision mask rendering algorithm, and roughly predicting the mask of the detected object through a fine-grained image recognition algorithm.
8. The point-rendering-based video background human image segmentation algorithm of claim 5, wherein the obtaining the fine-grained mask through the point-rendering-based neural network module using an iterative subdivision algorithm for segmentation prediction comprises:
step S41: inputting the fusion result output by the multi-scale feature fusion module into a PointRend network;
step S42: and taking the mask output by the segmentation prediction module as a coarse prediction, and performing segmentation prediction by using a position execution point adaptively selected by an iterative subdivision algorithm through PointRend network processing to obtain a fine-grained mask.
CN202210699582.1A 2022-06-20 2022-06-20 Video background portrait segmentation model and algorithm based on point rendering Pending CN115272906A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210699582.1A CN115272906A (en) 2022-06-20 2022-06-20 Video background portrait segmentation model and algorithm based on point rendering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210699582.1A CN115272906A (en) 2022-06-20 2022-06-20 Video background portrait segmentation model and algorithm based on point rendering

Publications (1)

Publication Number Publication Date
CN115272906A true CN115272906A (en) 2022-11-01

Family

ID=83761306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210699582.1A Pending CN115272906A (en) 2022-06-20 2022-06-20 Video background portrait segmentation model and algorithm based on point rendering

Country Status (1)

Country Link
CN (1) CN115272906A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116109753A (en) * 2023-04-12 2023-05-12 深圳原世界科技有限公司 Three-dimensional cloud rendering engine platform and data processing method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116109753A (en) * 2023-04-12 2023-05-12 深圳原世界科技有限公司 Three-dimensional cloud rendering engine platform and data processing method

Similar Documents

Publication Publication Date Title
Yan et al. Channel-wise attention-based network for self-supervised monocular depth estimation
CN108460411B (en) Instance division method and apparatus, electronic device, program, and medium
EP2989607B1 (en) Method and device for performing super-resolution on an input image
CN112330574B (en) Portrait restoration method and device, electronic equipment and computer storage medium
CN111583097A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN109858333B (en) Image processing method, image processing device, electronic equipment and computer readable medium
CN111968064B (en) Image processing method and device, electronic equipment and storage medium
KR20200087808A (en) Method and apparatus for partitioning instances, electronic devices, programs and media
CN112990219B (en) Method and device for image semantic segmentation
CN110796649B (en) Target detection method and device, electronic equipment and storage medium
CN110188802B (en) SSD target detection algorithm based on multi-layer feature map fusion
WO2022217876A1 (en) Instance segmentation method and apparatus, and electronic device and storage medium
CN109816659B (en) Image segmentation method, device and system
CN111914654B (en) Text layout analysis method, device, equipment and medium
Singh et al. Survey on single image based super-resolution—implementation challenges and solutions
EP3836083B1 (en) Disparity estimation system and method, electronic device and computer program product
CN108876716B (en) Super-resolution reconstruction method and device
CN112154476A (en) System and method for rapid object detection
WO2020043296A1 (en) Device and method for separating a picture into foreground and background using deep learning
CN112802197A (en) Visual SLAM method and system based on full convolution neural network in dynamic scene
CN115272906A (en) Video background portrait segmentation model and algorithm based on point rendering
CN112966639B (en) Vehicle detection method, device, electronic equipment and storage medium
CN117058554A (en) Power equipment target detection method, model training method and device
WO2022033088A1 (en) Image processing method, apparatus, electronic device, and computer-readable medium
US20230325985A1 (en) Systems and methods for inpainting images at increased resolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination