CN112950640A

CN112950640A - Video portrait segmentation method and device, electronic equipment and storage medium

Info

Publication number: CN112950640A
Application number: CN202110203277.4A
Authority: CN
Inventors: 刘钰安; 杨统; 郭彦东
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-02-23
Filing date: 2021-02-23
Publication date: 2021-06-11

Abstract

The application relates to a video portrait segmentation method, a video portrait segmentation device, electronic equipment and a storage medium, and belongs to the technical field of computational vision. The method comprises the following steps: inputting a video file to be processed into a portrait segmentation model; determining a first feature map of a first video frame in the video file through the portrait segmentation model, wherein the first video frame is a video frame except the first two frames in the video file to be processed; determining second feature maps of a plurality of reference frames and a weight of each second feature map, wherein the plurality of reference frames are video frames which are positioned before the first video frame in the video file; and performing portrait segmentation on the first video frame based on the second feature maps of the plurality of reference frames, the weight of each second feature map and the first feature map. According to the scheme, the human image of the first video frame is segmented by combining the feature maps and the weights of the multiple reference frames, and the video human image segmentation effect can be improved.

Description

Video portrait segmentation method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computational vision, in particular to a video portrait segmentation method, a video portrait segmentation device, electronic equipment and a storage medium.

Background

Image segmentation technology is a basic subject in the field of computational vision, and video portrait segmentation technology is an important application of image segmentation. In photo albums of terminals such as mobile phones and the like, video portrait segmentation technology is required; for example, in aspects of video portrait blurring or video background replacement, video portrait segmentation needs to be completed through a video portrait segmentation technology, and then portrait blurring or video background replacement is performed.

Disclosure of Invention

The embodiment of the application provides a video portrait segmentation method and device, electronic equipment and a storage medium, which can improve the portrait segmentation effect. The technical scheme is as follows:

in one aspect, a video portrait segmentation method is provided, and the method includes:

inputting a video file to be processed into a portrait segmentation model;

determining a first feature map of a first video frame in the video file through the portrait segmentation model, wherein the first video frame is a video frame except the first two frames in the video file to be processed;

determining second feature maps of a plurality of reference frames and a weight of each second feature map, wherein the plurality of reference frames are video frames which are positioned before the first video frame in the video file;

and performing portrait segmentation on the first video frame based on the second feature maps of the plurality of reference frames, the weight of each second feature map and the first feature map.

In another aspect, there is provided a video portrait segmentation apparatus, the apparatus comprising:

the input module is used for inputting the video file to be processed into the portrait segmentation model;

the first determining module is used for determining a first feature map of a first video frame in the video file through the portrait segmentation model, wherein the first video frame is a video frame except the first two frames in the video file to be processed;

a second determining module, configured to determine second feature maps of multiple reference frames and a weight of each second feature map, where the multiple reference frames are video frames located before the first video frame in the video file;

and the portrait segmentation module is used for segmenting the portrait of the first video frame based on the second feature maps of the plurality of reference frames, the weight of each second feature map and the first feature map.

In another aspect, an electronic device is provided, the electronic device comprising a processor and a memory; the memory stores at least one program code for execution by the processor to implement the video portrait segmentation method as described in the above aspect.

In another aspect, a computer readable storage medium is provided, the storage medium storing at least one program code for execution by a processor to implement the video portrait segmentation method according to the above aspect.

In another aspect, a computer program product is provided, in which program code is enabled, when executed by a processor of an electronic device, to perform the video portrait segmentation method described in any of the above possible implementations.

In the embodiment of the application, the feature maps and the weights of a plurality of reference frames before a first video frame are obtained, and the weight of any reference frame is used for representing the influence degree of the reference frame on the first video frame, so that different weights are given to the reference frames with different influence degrees, and therefore the human image of the first video frame is segmented by combining the feature maps and the weights of the plurality of reference frames, and the video human image segmentation effect can be improved.

Drawings

FIG. 1 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a schematic structural diagram of an electronic device provided in an exemplary embodiment of the present application;

FIG. 3 illustrates a schematic diagram of an overall framework provided by an exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of a video portrait segmentation method shown in an exemplary embodiment of the present application;

FIG. 5 illustrates a schematic diagram of a portrait segmentation model shown in an exemplary embodiment of the present application;

FIG. 6 illustrates a schematic diagram of a lane grouping attention module shown in an exemplary embodiment of the present application;

FIG. 7 illustrates a schematic diagram of a spatial prior attention module shown in an exemplary embodiment of the present application;

FIG. 8 illustrates a schematic diagram of a dense hole feature pyramid module shown in an exemplary embodiment of the present application;

FIG. 9 illustrates a flow chart of a video portrait segmentation method shown in an exemplary embodiment of the present application;

fig. 10 shows a block diagram of a video image segmentation apparatus according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The video portrait segmentation method provided by the embodiment of the application is applied to the electronic device 100. In one possible implementation, the electronic device 100 is a terminal. In another possible implementation, the electronic device 100 is a server. In another possible implementation, the electronic device 100 includes a terminal and a server.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal and a server. The terminal and the server are connected through a wireless or wired network. The terminal is provided with a target application which is served by the server, and the terminal can realize functions such as data transmission, message interaction and the like through the target application. The target application is an application with a video portrait segmentation function; for example, the target application is an album application, an image processing application, a general camera application, a beauty camera application, or the like.

In a possible implementation manner, a terminal sends a video file to be processed to a server, the server performs video portrait segmentation on the video file through a portrait segmentation model, and a result obtained by the segmentation is returned to the terminal and displayed by the terminal.

In a possible implementation mode, the server trains the portrait segmentation model, the portrait segmentation model is issued to the terminal, the terminal performs video portrait segmentation on a video file to be processed through the portrait segmentation model, and a result obtained by the segmentation is displayed.

The video portrait segmentation method provided by the embodiment of the application can be applied to the following scenes:

for example, in a background replacement scenario.

When the background of a video file to be processed is replaced, the portrait mask of the video file is determined by the video portrait segmentation method provided by the embodiment of the application, and then the portrait mask and the target background are synthesized into a new video file.

As another example, in the context of video recording.

When a user records a video, the terminal determines a portrait mask of each frame of the video by adopting the video portrait segmentation method provided by the embodiment of the application, and then performs facial beautification on the portrait mask, so as to obtain a video file with the facial beautification effect.

As another example, the method is applied to a scene in which a video file is beautifully processed.

When a user beautifies a video file, the terminal determines a portrait mask of the video file by adopting the video portrait segmentation method provided by the embodiment of the application, and then beautifies the portrait mask, so that the video file with the beautification effect is obtained.

It should be noted that, in the embodiment of the present application, the three scenes are taken as examples, the video portrait segmentation scene is exemplarily described, and no limitation is imposed on the video portrait segmentation scene.

Referring to fig. 2, a block diagram of an electronic device 100 according to an exemplary embodiment of the present application is shown. The electronic device 100 may be a terminal having an image processing function, such as a smartphone or a tablet computer. The electronic device 100 in the present application may include one or more of the following components: processor 110, memory 120, display 130.

Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Neural-Network Processing Unit (NPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is responsible for rendering and drawing the content to be displayed on the display screen 130; the NPU is used for realizing an Artificial Intelligence (AI) function; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a single chip.

The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 120 includes a non-transitory computer-readable medium. The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like; the storage data area may store data (such as audio data, a phonebook) created according to the use of the electronic apparatus 100, and the like.

The display screen 130 is a display component for displaying a user interface. Optionally, the display screen 130 is a display screen with a touch function, and through the touch function, a user may use any suitable object such as a finger, a touch pen, and the like to perform a touch operation on the display screen 130.

The display 130 is typically disposed on the front panel of the electronic device 100. The display screen 130 may be designed as a full-face screen, a curved screen, a contoured screen, a double-face screen, or a folding screen. The display 130 may also be designed as a combination of a full-screen and a curved-screen, a combination of a special-shaped screen and a curved-screen, etc., which is not limited in this embodiment.

In addition, those skilled in the art will appreciate that the configuration of the electronic device 100 illustrated in the above figures does not constitute a limitation of the electronic device 100, and that the electronic device 100 may include more or fewer components than illustrated, or some components may be combined, or a different arrangement of components. For example, the electronic device 100 further includes a microphone, a speaker, a radio frequency circuit, an input unit, a sensor, an audio circuit, a Wireless Fidelity (Wi-Fi) module, a power supply, a bluetooth module, and other components, which are not described herein again.

Referring to fig. 3, a block diagram of video portrait segmentation is shown in accordance with an exemplary embodiment of the present application. Firstly, a video file to be processed is obtained, then the video file is preprocessed, the preprocessed video file is input into a portrait segmentation model, a portrait mask of each frame of video is segmented through the portrait segmentation model, and then the portrait masks of each frame of video are collected, so that video portrait segmentation is completed.

It should be noted that, if the present application is applied in a scene of real-time shooting, the step of acquiring a video file to be processed, and then preprocessing the video file may be replaced with: and acquiring a video frame shot in real time, and preprocessing the video frame.

Before video image segmentation is carried out through the image segmentation model, the image segmentation model needs to be trained. The process of training the portrait segmentation model is as follows: and acquiring a sample data set, preprocessing the sample data set, and performing model training based on the preprocessed sample data set.

Referring to fig. 4, a flow chart of a video portrait segmentation method according to an exemplary embodiment of the present application is shown. The execution subject in this embodiment may be an electronic device, or may also be a processor in the electronic device or an operating system in the electronic device, and this embodiment takes the execution subject as the electronic device as an example for description. In the embodiment of the present application, taking processing a video file to be processed through a portrait segmentation model as an example for explanation, the method includes:

step 401: the electronic equipment determines that a video file to be processed is input into the portrait segmentation model.

In one possible implementation, the electronic device may directly input the video file into the portrait segmentation model. In another possible implementation, the electronic device first preprocesses the video file and then inputs the video file into the portrait segmentation model. Wherein the preprocessing includes at least one of image normalization processing, image enhancement processing, and brightness adjustment. The image enhancement processing includes at least one of random rotation, random flip only, parameter transformation, and the like.

And when the video file is obtained by recording and the portrait of the video file is segmented, sequentially inputting each video frame of the video file to be processed into the portrait segmentation model. In the process of recording the video, under the condition of carrying out portrait segmentation on the video, the electronic equipment acquires video frames in real time and inputs the acquired video frames into a portrait segmentation model.

Step 402: for a first video frame in the video file, the electronic device determines a first feature map of the first video frame through a human image segmentation model.

The first video frame is a video frame except the first two frames in the video file to be processed. In the process of recording the video, under the condition of segmenting the video by human images, the first video frame is the current video frame in the video file.

Referring to fig. 5, the human image segmentation model includes a query encoder, and the query encoder is used for extracting a feature map of a video frame. Then in this step, the electronic device inputs the first video frame into the query encoder, and outputs the frame identifier and the first feature map of the first video frame. In addition, the electronic device stores the frame identifier of the first video frame and the first feature map in a Key-Value form, that is, the frame identifier of the first video frame is used as a Key, the first feature map is used as a Value, and the corresponding relationship between the frame identifier of the first video frame and the first feature map is stored. The frame identifier may be a unique identifier of a frame, such as a frame number, and in the embodiment of the present application, the frame identifier is taken as the frame number for example.

It should be noted that the query encoder is composed of a basic network resenestt 101 module and a KV generation module, and has a stronger feature extraction capability compared with the common resenet 101, so that the video portrait segmentation effect can be improved. The system comprises a basic network ResNeSt101 module, a KV generation module and two convolution layers, wherein the basic network ResNeSt101 module is used for determining a feature map of a video frame, the KV generation module is used for generating a corresponding relation between a frame identifier of the video frame and the feature map, and the two convolution layers can generate the frame identifier of a first video frame in a Key-Value form and a first feature map based on the first feature map.

Step 403: the electronic device determines a second feature map for the plurality of reference frames.

The multiple reference frames are video frames which are positioned before the first video frame in the video file. The plurality of reference frames may be any video frame before the first video frame, and the intervals between two adjacent reference frames in the plurality of reference frames are equal or unequal; in the embodiment of the present application, the description is given by taking the example that the intervals between two adjacent reference frames are equal. Accordingly, the step of the electronic device determining the second feature maps of the plurality of reference frames may be implemented by the following steps (1) to (2), including:

(1) the electronic device determines frame identifications of a plurality of reference frames based on the frame identification of the first video frame and the sampling interval.

Wherein the interval between the frame identifications of two adjacent reference frames is not greater than the sampling interval. The electronic device fetches the first preset number of video frames from the last in the order of [ n,2n,3n, … … kn ] as reference frames starting from the 0 th frame. The frame number of the first video frame is greater than kn and less than k (n + 1); n is the sampling interval.

It should be noted that, the electronic device may store feature maps of all video frames before the first video frame, or may only store a first preset number of video frames before the first video frame, and if the first preset number of video frames is exceeded, delete the feature map of the oldest video frame, thereby saving the memory of the electronic device.

(2) The electronic device queries feature maps of a plurality of reference frames from the stored feature maps based on frame identifications of the plurality of reference frames.

The electronic equipment stores the frame identification and the feature graph of the video frame in a Key-Value form; in this step, for each reference frame, the electronic device uses the frame identifier of the reference frame as a Key, obtains a Value corresponding to the Key from the Key-Value, and determines the Value as the feature map of the reference frame.

For example, with continued reference to fig. 5, the portrait segmentation model includes a memory encoder for storing a feature map of a video frame preceding the first video frame. In this step, the electronic device queries the feature maps of the plurality of reference frames from the memory encoder.

It should be noted that, the memory encoder is also composed of a basic network resenestt 101 module and a KV generation module, which has a stronger feature extraction capability compared with the common resenet 101, so that the video portrait segmentation effect can be improved.

Step 404: the electronics determine a weight for the second feature map for each reference frame.

The portrait segmentation model comprises a channel grouping attention module, wherein the channel grouping attention module is used for conducting attention guidance on the feature map of the reference frame, giving a larger weight to the second feature map of the more important reference frame, and giving a smaller weight to the second feature map of the less important reference frame. Accordingly, in this step, the electronic device inputs the second feature map of each reference frame into the channel grouping attention module, and outputs the weight of the second feature map of each reference frame.

It should be noted that the electronic device may directly input the second feature maps of the plurality of second reference frames into the channel grouping attention module in sequence, or may splice the second feature maps of the plurality of second reference frames first, and input the spliced second feature maps into the channel grouping attention module, for example, refer to fig. 5 continuously.

For example, referring to fig. 6, the channel packet attention module includes a packet convolution layer, a global pooling layer, a first fully-connected layer, a modification layer, a second fully-connected layer, and a normalization layer, which are connected in sequence. The packet convolution layer is used for performing convolution processing on the second feature map of each second reference frame respectively. And the global pooling layer is used for respectively reducing the dimension of the second characteristic graph after the packet convolutional layer processing so as to reduce the calculation amount. The first full connection layer is used for classifying the second feature map after dimensionality reduction; the correction layer is used for correcting the classification result; the second full connection layer is used for determining the weight of each second feature map based on the corrected classification result; and the normalization layer is used for normalizing the weight of each second feature map.

With reference to fig. 6, the second feature maps of the multiple reference frames are respectively a second feature map 1, a second feature map 2, and a second feature map 3, and the weights of the second feature map 1, the second feature map 2, and the second feature map 3 are respectively weight 1, weight 2, and weight 3 after the second feature map 1, the second feature map 2, and the second feature map 3 are processed by the packet convolutional layer, the global pooling layer, the first fully-connected layer, the correction layer, the second fully-connected layer, and the normalization layer.

Step 405: and the electronic equipment weights the second feature map of each reference frame based on the weight of the second feature map to obtain a third feature map.

And the electronic equipment multiplies the second feature map by the weight of the second feature map to obtain a third feature map. For example, with continued reference to fig. 6, the electronic device multiplies the second feature map 1, the second feature map 2, and the second feature map 3 by the weight 1, the weight 2, and the weight 3, respectively, to obtain a third feature map 1, a third feature map 2, and a third feature map 3.

Step 406: and the electronic equipment splices the first feature map and the third feature map of each reference frame to obtain a fourth feature map.

For example, with continued reference to fig. 5, the image segmentation model includes a spatiotemporal memory reading module, and the electronic device inputs the first feature map and the third feature map of each reference frame into the spatiotemporal memory reading module, respectively, and outputs a new feature map (fourth feature map). The space-time memory reading module adopts a space-time memory reading module in the STM.

Step 407: and the electronic equipment decodes the fourth feature map to obtain a first portrait mask of the first video frame.

The image segmentation model comprises a decoder, the electronic equipment inputs the fourth feature map into the decoder, and the decoder encodes the fourth feature map to obtain a first image mask. Wherein, the decoder adopts decoder in the STM model to improve the precision. The decoder can also adopt a decoder of a Deeplab V3+ model to improve the use efficiency.

In one possible implementation, the electronic device may also predict the portrait mask of the first video frame in combination with the portrait masks of the first two frames; accordingly, this step can be realized by the following steps (1) to (3), including:

(1) the electronic device obtains a second portrait mask for the second video frame and a third portrait mask for the third video frame.

The second video frame and the third video frame are the first two frames of the first video frame in the video file respectively. The electronic equipment predicts a portrait mask for video frames in a video file according to the sequence of the video frames; thus, the electronic device has predicted and stored a portrait mask for a video frame preceding the first video frame; in this step, the stored second portrait mask of the second video frame and the third portrait mask of the third video frame are directly obtained.

(2) The electronic device weights the fourth feature map based on the second portrait mask and the third portrait mask.

The image segmentation model comprises a spatial attention prior module, and the spatial attention prior module is used for providing spatial attention prior information for prediction of a current frame (a first video frame) by using the image masks (a second image mask and a third image mask) of the first two frames which are relatively related, namely weighting the fourth feature map. In this step, the electronic device inputs the fourth feature map and the second portrait mask and the third portrait mask into the spatial attention prior module, and the spatial attention prior module weights the fourth feature map.

(3) And the electronic equipment decodes the weighted fourth feature map to obtain a first portrait mask.

For example, referring to fig. 7, the electronic device splices the second portrait mask and the third portrait mask to obtain a fourth portrait mask, then splices the fourth feature map and the fourth portrait mask, then determines a weight based on the feature map obtained by splicing, weights the fourth feature map based on the weight, and decodes the weighted fourth feature map to obtain the first portrait mask.

Wherein the spatial prior attention module comprises a convolutional layer and a normalization layer. Wherein the convolutional layer is used to determine the fourth feature map and the normalization layer is used to determine the weights. The normalization layer is a Sigmoid activation function.

In the embodiment of the application, the portrait masks of the first two related frames are utilized to provide spatial attention prior information for the prediction of the portrait mask of the first video frame, so that the portrait masks of the first two frames are fully utilized, prior knowledge is given to the first video frame, and the portrait segmentation accuracy is improved.

In another possible implementation, the electronic device further supports adding a jump connection between the query encoder and decoder. Accordingly, this step can be realized by the following steps (a) to (C), including:

(A) the electronic device obtains a shallow feature map of the first video frame.

(B) And the electronic equipment splices the fourth characteristic diagram and the shallow characteristic diagram to obtain a fifth characteristic diagram.

(C) And the electronic equipment decodes the fifth feature map to obtain a first portrait mask.

In the embodiment of the application, a skip link is additionally arranged between the query encoder and the decoder to obtain a shallow feature map so as to improve the detail segmentation effect.

In another possible implementation, the electronic device may also aggregate multi-scale feature maps; correspondingly, the method comprises the following steps: the electronic equipment respectively extracts features of the fourth feature graph based on the void ratios to obtain a plurality of feature graphs with different scales, splices the feature graphs with different scales to obtain a sixth feature graph, and decodes the sixth feature graph to obtain the first portrait mask.

The portrait segmentation model comprises a dense void feature pyramid module, the electronic equipment performs feature extraction on the fourth feature map through the dense void feature pyramid module to obtain a plurality of feature maps with different scales, the feature maps with different scales are spliced to obtain a sixth feature map, and then a decoder decodes the sixth feature map to obtain a first portrait mask. The dense hole characteristic pyramid module is mainly composed of hole convolutions with different expansion rates.

For example, referring to fig. 8, the original fourth feature map is a fourth feature map 1, the void rates are 3, 6, 12, 18, and 24, the dense void feature pyramid module performs feature extraction on the fourth feature map 1 according to the void rates of 3, 6, 12, 18, and 24, so as to obtain a plurality of feature maps with different scales (fourth feature map 2, fourth feature map 3, fourth feature map 4, fourth feature map 5, and fourth feature map 6), and concatenates the plurality of feature maps with different scales (fourth feature map 1 to fourth feature map 6), so as to obtain a sixth feature map.

In the embodiment of the application, an open-source dense hole feature pyramid module is added in front of a decoder, and hole convolutions with different hole rates are utilized to fuse and extract multi-scale feature maps, so that the feature extraction effect can be improved, and the portrait segmentation accuracy is improved.

It should be noted that, in combination with at least one of the foregoing implementations, the electronic device may decode the fourth feature map to obtain the first portrait mask. If the fourth feature map is decoded by combining the multiple implementation manners to obtain the first portrait mask, the portrait segmentation model continuously refers to fig. 5, and the portrait segmentation model comprises a channel grouping attention module, a space-time memory reading module, a dense void feature pyramid module and a space attention prior module, and the channel grouping attention module, the space-time memory reading module, the dense void feature pyramid module and the space attention prior module are sequentially connected.

Another point to be noted is that, for each video frame except the first two frames in the video file, the human image mask of each frame of video can be determined according to the method provided by the embodiment of the present application; for the video frames of the first two frames, the feature maps of the video frames of the first two frames can be directly determined, and the feature maps of the video frames of the first two frames are decoded to obtain the portrait mask of the video frames of the first two frames. And if the image mask is obtained by segmenting the image of each frame of video in the video file, splicing the image mask of each frame of video to finish the segmentation of the video image.

Referring to fig. 9, a flowchart of a video portrait segmentation method provided by an embodiment of the present application is shown. The execution subject in this embodiment may be an electronic device, or may also be a processor in the electronic device or an operating system in the electronic device, and this embodiment takes the execution subject as the electronic device as an example for description. In the embodiment of the present application, a training portrait segmentation model is taken as an example for explanation, and the method includes:

step 901: the electronic device obtains a sample video, which is annotated with a portrait mask for each video frame.

The number of sample videos may be plural; in this step, the sample video may be divided into a test set and a training set according to a preset ratio. The test set and the training set respectively comprise at least one sample video. The preset proportion can be set and changed according to the requirement; and the preset proportion needs to meet the requirement that the number of sample videos included in the training set is more than that of the sample videos included in the test set. For example, the preset ratio may be 2:8, and if the number of sample videos is 10, 2 sample videos are combined into a test set, and 8 sample videos are combined into a training set.

It should be noted that, after the electronic device acquires the sample video, step 902 may be directly executed, or the sample video may be preprocessed first, and then step 902 is executed based on the preprocessed sample video.

Step 902: the electronic equipment selects a target sample frame from the sample video, wherein the target sample frame is a video frame except the first two frames in the sample video.

The electronic equipment randomly selects one frame of video from the video frames except the first two frames in the sample video as a target sample frame.

It should be noted that, in a training period, each sample video in the training set is traversed, and a second preset number of iterations are performed on each sample video, that is, for each sample video, a target sample frame is selected from the sample video for the first time, then step 903-.

Step 903: the electronic device determines a seventh feature map of the target sample frame.

The initial face segmentation model comprises a query encoder, and the electronic device inputs the target sample frame into the query encoder to obtain a seventh feature map of the target sample frame.

Step 904: the electronic device determines eighth feature maps of a plurality of sample reference frames and a weight of each eighth feature map, wherein the plurality of sample reference frames are video frames in the sample video before the target sample frame.

The initial face segmentation model comprises a memory encoder, the electronic device queries eighth feature maps of the plurality of sample reference frames from the memory encoder, and the initial face segmentation model comprises a channel grouping attention module, and the electronic device determines the weight of each eighth feature map through the channel grouping attention module. It should be noted that the initial face segmentation model is a model based on a spatio-temporal convolutional neural network.

Step 905: and the electronic equipment performs model training based on the eighth feature maps of the multiple sample reference frames, the weight of each eighth feature map, the seventh feature map and the portrait mask of the target sample frame to obtain a portrait segmentation model.

For the eighth feature map of each sample reference frame, the electronic device weights the eighth feature map based on the weight of the eighth feature map to obtain a ninth feature map; splicing the seventh feature map and the eighth feature map of each sample reference frame to obtain a ninth feature map; and decoding the ninth characteristic diagram to obtain a predicted portrait mask of the target sample frame, and updating model parameters of the initial portrait segmentation model based on the portrait mask marked by the target sample frame and the predicted portrait mask to obtain a final portrait segmentation model.

In a possible implementation manner, the step of decoding, by the electronic device, the ninth feature map to obtain a portrait mask of the predicted target sample frame includes: the electronic equipment acquires a portrait mask marked by a fourth video frame and a portrait mask marked by a fifth video frame, wherein the fourth video frame and the fifth video frame are the first two frames of a target sample frame in the sample video respectively; and weighting the ninth feature map based on the portrait mask labeled by the fourth video frame and the portrait mask labeled by the fifth video frame, and decoding the weighted ninth feature map to obtain the predicted portrait mask of the target sample frame.

In another possible implementation manner, the step of decoding, by the electronic device, the ninth feature map to obtain a portrait mask of the predicted target sample frame includes: the electronic equipment acquires a shallow layer characteristic diagram of a target sample frame; splicing the ninth characteristic diagram and the shallow characteristic diagram to obtain a tenth characteristic diagram; and decoding the tenth characteristic diagram to obtain a human image mask of the predicted target sample frame.

In another possible implementation manner, the step of decoding, by the electronic device, the ninth feature map to obtain a portrait mask of the predicted target sample frame includes: the electronic equipment respectively extracts the features of the ninth feature map based on the void ratios to obtain a plurality of feature maps with different scales; splicing the feature maps with different scales to obtain an eleventh feature map; and decoding the eleventh feature map to obtain a human image mask of the predicted target sample frame.

It should be noted that the above implementation process is similar to the process of performing the portrait segmentation on the first video frame through the portrait segmentation model, and is not repeated herein.

The electronic equipment updates model parameters of the initial portrait segmentation model based on the portrait mask marked by the target sample frame and the predicted portrait mask to obtain a final portrait segmentation model, and the step of obtaining the final portrait segmentation model comprises the following steps:

and the electronic equipment calculates a cross entropy loss value between the portrait mask marked by the target sample frame and the predicted portrait mask, and executes a back propagation algorithm on the initial portrait segmentation model based on the cross entropy loss value to update model parameters of the initial portrait segmentation model until the loss function is completely converged to obtain a final portrait segmentation model.

The electronic equipment can calculate the cross entropy loss value between the portrait mask marked by the target sample frame and the predicted portrait mask through the following formula I:

the formula I is as follows:

wherein the content of the first and second substances,

represents the cross entropy loss value, i represents any pixel point of the target sample frame in the sample video, y_iValue, p, of a pixel point i representing a mark_iRepresenting the value of the predicted pixel point i. N represents the total number of pixel points included in the sample video.

Note that, if a plurality of sample videos are included, the log loss for all sample videos represents the average of the log loss for each sample video. Ideally, the log loss should be 0.

Another point to be noted is that the present application may also evaluate a sample video, where the process is as follows:

the electronic equipment determines the evaluation value of the sample video through the following formula II;

the formula II is as follows:

here, IoU denotes the evaluation value, x denotes a human image mask of sample video prediction, and Y denotes a human image mask of sample video annotation.

It should be noted that, after the electronic device obtains the portrait segmentation model through training, the electronic device may further perform fine adjustment on the portrait segmentation model based on the test set.

It should be noted that the portrait segmentation model is not dependent on a specific device, and can be deployed on various terminals or servers.

Referring to fig. 10, a block diagram of a video image segmentation apparatus according to an embodiment of the present application is shown. The video portrait splitting apparatus may be implemented as all or part of the processor 110 by software, hardware, or a combination of both. The device includes:

an input module 1001, configured to input a video file to be processed into a portrait segmentation model;

a first determining module 1002, configured to determine, through a portrait segmentation model, a first feature map of a first video frame in a video file, where the first video frame is a video frame, except for the first two frames, in the video file to be processed;

a second determining module 1003, configured to determine second feature maps of multiple reference frames and a weight of each second feature map, where the multiple reference frames are video frames located before the first video frame in the video file;

and a portrait segmentation module 1004, configured to perform portrait segmentation on the first video frame based on the second feature maps of the multiple reference frames, the weight of each second feature map, and the first feature map.

In one possible implementation, the face segmentation module 1004 includes:

the weighting unit is used for weighting the second feature map of each reference frame based on the weight of the second feature map to obtain a third feature map;

the splicing unit is used for splicing the first feature map and the third feature map of each reference frame to obtain a fourth feature map;

and the decoding unit is used for decoding the fourth feature map to obtain a first portrait mask of the first video frame.

In another possible implementation manner, the decoding unit is configured to obtain a second portrait mask of a second video frame and a third portrait mask of a third video frame, where the second video frame and the third video frame are the first two frames of the first video frame in the video file, respectively; weighting the fourth feature map based on the second portrait mask and the third portrait mask; and decoding the weighted fourth feature map to obtain a first portrait mask.

In another possible implementation manner, the decoding unit is configured to obtain a shallow feature map of the first video frame; splicing the fourth characteristic diagram and the shallow characteristic diagram to obtain a fifth characteristic diagram; and decoding the fifth feature map to obtain a first portrait mask.

In another possible implementation manner, the decoding unit is configured to perform feature extraction on the fourth feature map based on a plurality of void ratios, respectively, to obtain a plurality of feature maps of different scales; splicing a plurality of feature maps with different scales to obtain a sixth feature map; and decoding the sixth feature map to obtain a first portrait mask.

In another possible implementation manner, the second determining module 1003 is configured to determine frame identifiers of multiple reference frames based on a frame identifier of the first video frame and a sampling interval, where an interval between frame identifiers of two adjacent reference frames is not greater than the sampling interval; and querying feature maps of the multiple reference frames from the stored feature maps based on the frame identifications of the multiple reference frames.

In another possible implementation manner, the apparatus further includes:

the acquisition module is used for acquiring sample videos, and the sample videos are marked with the portrait masks of each frame of video;

the selection module is used for selecting a target sample frame from the sample video, wherein the target sample frame is a video frame except the first two frames in the sample video;

and the third determining module is used for determining a seventh feature map of the target sample frame.

The fourth determining module is used for determining eighth feature maps of a plurality of sample reference frames and the weight of each eighth feature map, wherein the plurality of sample reference frames are video frames positioned before the target sample frame in the sample video;

and the model training module is used for carrying out model training based on the eighth feature maps of the multiple sample reference frames, the weight of each eighth feature map, the seventh feature map and the portrait mask of the target sample frame to obtain a portrait segmentation model.

The embodiment of the present application further provides a computer-readable medium, which stores at least one program code, and the at least one program code is loaded and executed by the processor to implement the video portrait segmentation method as shown in the above embodiments.

An embodiment of the present application further provides a computer program product, wherein when a processor of an electronic device executes program codes in the computer program product, the electronic device is enabled to execute the video portrait segmentation method in any one of the above possible implementation manners.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more program codes or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for video portrait segmentation, the method comprising:

inputting a video file to be processed into a portrait segmentation model;

2. The method of claim 1, wherein the segmenting the first video frame based on the second feature maps of the plurality of reference frames, the weight of each second feature map, and the first feature map comprises:

weighting the second feature map of each reference frame based on the weight of the second feature map to obtain a third feature map;

splicing the first feature map and the third feature map of each reference frame to obtain a fourth feature map;

and decoding the fourth feature map to obtain a first portrait mask of the first video frame.

3. The method of claim 2, wherein said decoding the fourth feature map to obtain the first human image mask of the first video frame comprises:

acquiring a second portrait mask of a second video frame and a third portrait mask of a third video frame, wherein the second video frame and the third video frame are respectively the first two frames of the first video frame in the video file;

weighting the fourth feature map based on the second portrait mask and the third portrait mask;

and decoding the weighted fourth feature map to obtain the first portrait mask.

4. The method of claim 2, wherein said decoding the fourth feature map to obtain the first human image mask of the first video frame comprises:

acquiring a shallow feature map of the first video frame;

splicing the fourth characteristic diagram and the shallow characteristic diagram to obtain a fifth characteristic diagram;

and decoding the fifth feature map to obtain the first portrait mask.

5. The method of claim 2, wherein said decoding the fourth feature map to obtain the first human image mask of the first video frame comprises:

respectively extracting the features of the fourth feature map based on a plurality of void ratios to obtain a plurality of feature maps with different scales;

splicing the feature maps with different scales to obtain a sixth feature map;

and decoding the sixth feature map to obtain the first portrait mask.

6. The method of claim 1, wherein the determining the second feature map of the plurality of reference frames comprises:

determining frame identifications of a plurality of reference frames based on the frame identification of the first video frame and a sampling interval, wherein the interval between the frame identifications of two adjacent reference frames is not larger than the sampling interval;

and querying feature maps of the plurality of reference frames from the stored feature maps based on the frame identifications of the plurality of reference frames.

7. The method of claim 1, wherein the training process of the human image segmentation model comprises:

acquiring sample videos, wherein portrait masks of each frame of video are marked in the sample videos;

selecting a target sample frame from the sample video, wherein the target sample frame is a video frame except the first two frames in the sample video;

and determining a seventh feature map of the target sample frame.

Determining eighth feature maps of a plurality of sample reference frames and a weight of each eighth feature map, wherein the plurality of sample reference frames are video frames which are positioned before the target sample frame in the sample video;

and performing model training based on the eighth feature maps of the multiple sample reference frames, the weight of each eighth feature map, the seventh feature map and the portrait mask of the target sample frame to obtain the portrait segmentation model.

8. A video portrait segmentation apparatus, characterized in that the apparatus comprises:

9. An electronic device, comprising a processor and a memory; the memory stores at least one program code for execution by the processor to implement the video portrait segmentation method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the storage medium stores at least one program code for execution by a processor to implement the video portrait segmentation method according to any one of claims 1 to 7.