CN110796662A

CN110796662A - Real-time semantic video segmentation method

Info

Publication number: CN110796662A
Application number: CN201910859421.2A
Authority: CN
Inventors: 冯君逸; 李颂元; 李玺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2020-02-14
Anticipated expiration: 2039-09-11
Also published as: CN110796662B

Abstract

The invention discloses a real-time semantic video segmentation method which is used for greatly accelerating a semantic segmentation algorithm of a video. The method specifically comprises the following steps: 1) acquiring a plurality of groups of data sets for training semantic segmentation, and defining an algorithm target; 2) training a lightweight image semantic segmentation CNN model; 3) decoding an original video to obtain a residual error image, a motion vector and an RGB image; 4) if the current frame is an I frame, sending the I frame to the segmentation model obtained in the step 2) to obtain a complete segmentation result; 5) if the current frame is a P frame, transmitting the segmentation result of the previous frame to the current frame by using the motion vector, and selecting a sub-block of the current frame for correction by using a residual error map; 6) and repeating the steps 4) and 5) until the segmentation of all the video frames is completed. The method makes full use of the correlation of adjacent frames in the video, and the accelerated processing based on the compressed domain information can rapidly complete the complex segmentation task and simultaneously keep higher accuracy, and the efficiency is improved by tens of times compared with the common segmentation method.

Description

Real-time semantic video segmentation method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a real-time semantic video segmentation method.

Background

Semantic video segmentation is a computer vision task that assigns a semantic category to each pixel of each frame of a video. In real-time semantic video segmentation, certain requirements are put on the segmentation speed, and generally more than 24 frames per second. Current advanced semantic video segmentation methods are Convolutional Neural Network (CNN) based machine learning methods, which in turn can be broadly divided into two categories, continuous image frame based and direct video based. The first method treats video as a sequence of image frames, which trade off a point segmentation accuracy for real-time semantic segmentation performance by reducing the scale of input data or cutting the network. This type of method does not exploit the inter-frame coherence implied by the video. The second method extracts inter-frame coherent features on a video through technologies such as optical flow, 3DCNN, RNN and the like, but the technologies are time-consuming and can become bottlenecks of semantic video segmentation.

In fact, the existing compressed video itself already contains inter-frame coherence information, i.e. motion vectors (Mv) and residual information (Res). This information is very fast to obtain and with them the speed of semantic video segmentation can be greatly increased. However, inter-frame coherent information provided by compressed video has larger noise compared with optical flow and other technologies, and how to utilize the compressed information and ensure accurate segmentation becomes a key problem solved by the method.

Disclosure of Invention

In order to solve the above problems, the present invention provides a real-time semantic video segmentation method. The method is based on a deep neural network, based on an image semantic segmentation model, and further utilizes strong correlation between adjacent picture frames in the video and multi-modal motion information in a video compression domain to carry out rapid inference, thereby realizing a real-time semantic video segmentation effect.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a real-time semantic video segmentation method comprises the following steps:

s1, acquiring a plurality of groups of videos for training semantic segmentation, and defining an algorithm target;

s2, training a lightweight image semantic segmentation CNN model;

s3, decoding the video to obtain a residual error image, a motion vector and an RGB image;

s4, for the current frame in the video, if the current frame is an I frame, sending the RGB image to the image semantic segmentation CNN model after training in S2 to obtain a complete segmentation result;

s5, for the current frame in the video, if the current frame is a P frame, transmitting the segmentation result of the previous frame to the current frame by using the motion vector, and selecting a sub-block of the current frame for correction by using a residual error map;

s6, repeating the steps S3 and S4 for all frames in the video until the segmentation of all video frames is completed.

Further, in step S1, for each video V for video semantic segmentation, defining an algorithm target as: the classification of all the pixels of each frame of image in the video V is detected.

Further, in step S2, the training of the lightweight image semantic segmentation CNN model specifically includes:

s21, carrying out classification extraction on each pixel in the image by using a convolutional neural network phi of a single picture, wherein the classification prediction result of the image I processed by the convolutional neural network phi is phi (I);

and S22, calculating cross entropy loss according to the prediction and the given classification label to optimize parameters in the network phi.

Further, in step S3, the video is encoded and decoded by using the MPEG-4 video encoding and decoding standard, and the group of pictures GOP parameter g and the B frame ratio β are set, where the decoding process is as follows when the current frame time is t:

s31, if the current t frame is an I frame, directly decoding to obtain an RGB image I (t) of the current t frame;

s32, if the current t frame is a P frame, firstly, partially decoding to obtain a motion vector Mv (t) and a residual vector Res (t), and further decoding to obtain an RGB image I (t) according to the translation and compensation transformation of a pixel domain.

Further, in step S4, if the current t-th frame is an I-frame, the current t-th frame is semantically segmented according to the following algorithm:

and S41, sending the current RGB image I (t) into the image semantic segmentation CNN model trained in S2 for prediction to obtain a semantic segmentation result F (t) ═ phi (I (t)).

Further, in step S5, if the current t-th frame is a P frame, the current t-th frame is semantically segmented according to the following algorithm:

s51, performing pixel domain translation on the segmentation result F (t-1) of the previous frame by using the motion vector Mv (t) of the current frame to obtain the segmentation result of the current frame:

F(t)[p]＝F(t-1)[p-Mv(t)[p]]

wherein: f (t) p represents the value of the pixel position p in the segmentation result F (t) of the current t-th frame obtained after translation; p is a pixel coordinate; mv (t) [ p ] denotes a value at a pixel position p in the motion vector diagram mv (t) of the current t-th frame;

s52, utilizing the residual error map Res (t) of the current frame to obtain all the sub-regions R of the current frame_iAnd selecting the subarea with the most pixel points and the residual value larger than the threshold value as the subarea R (t) to be subdivided:

wherein R is_iRepresenting the ith candidate sub-region; res (t) [ p ]]Represents the residual value at pixel position p in residual map res (t); THR is an artificially set threshold; indicator represents an Indicator function, if | Res (t) [ p ]]|>If THR is established, the value is 1, otherwise, the value is 0;

s53, sending the sub-region R (t) obtained in S52 into the image semantic segmentation CNN model trained in S2 for re-segmentation to obtain a new semantic segmentation result F of the sub-region_R(t)：

F_R(t)＝φ(I(t)[R(t)])

Wherein i (t) [ r (t) ] represents an RGB image of the r (t) sub-region;

s54, updating the segmentation result of the R (t) sub-region in the current frame according to the segmentation result of the sub-region obtained in the step S53:

F(t)[R(t)]＝F_R(t)

wherein, F (t) [ R (t) ] represents the segmentation result of the R (t) sub-area in the current t-th frame.

The efficiency of the non-key frame segmentation algorithm based on the step of S5 is much higher than that of a method for performing segmentation processing on frames by CNN, and the processing speed of the method for P frames is dozens of times higher than that of the method for performing frame-by-frame processing by avoiding performing redundant feature extraction on similar images.

The method makes full use of the correlation of adjacent frames in the video, and the accelerated processing based on the compressed domain information can rapidly complete the complex segmentation task and simultaneously keep higher accuracy, and the efficiency is improved by tens of times compared with the common segmentation method.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

As shown in fig. 1, a real-time semantic video segmentation method includes the following steps:

s1, acquiring a plurality of groups of videos for training semantic segmentation, and defining an algorithm target. In this step, for each video V used for video semantic segmentation, the algorithm objective is defined as: the classification of all the pixels of each frame of image in the video V is detected.

S2, training a lightweight image semantic segmentation CNN model. In this step, training the lightweight image semantic segmentation CNN model specifically includes:

S3, decoding the video to obtain a residual error image, a motion vector and an RGB image, in the step, encoding and decoding the video by using an MPEG-4 video encoding and decoding standard, setting a group of pictures (GOP) parameter g and a B frame ratio β, wherein if the current frame time is t, the decoding process needs to distinguish whether the current frame is an I frame or a P frame, and the respective decoding process is as follows:

And S4, for the current frame in the video, if the current frame is an I frame, sending the RGB image to the image semantic segmentation CNN model after training in S2 to obtain a complete segmentation result.

In this step, if the current t-th frame is an I-frame, semantic segmentation is performed on the t-th frame according to the following algorithm:

And S5, for the current frame in the video, if the current frame is a P frame, transmitting the segmentation result of the previous frame to the current frame by using the motion vector, and selecting the sub-block of the current frame for correction by using the residual error map.

In this step, if the current t-th frame is a P frame, semantic segmentation is performed on the current t-th frame according to the following algorithm:

F(t)[p]＝F(t-1)[p-Mv(t)[p]]

and S52, the current frame image can be subjected to gridding treatment to equally divide the length and the width of the image respectively to form a plurality of sub-blocks, namely sub-regions. From all sub-regions R of the current frame, using the residual map Res (t) of the current frame_iAnd selecting the subarea with the most pixel points and the residual value larger than the threshold value as the subarea R (t) to be subdivided:

s53, for the sub-region r (t) obtained in S52, we consider that it is changed greatly compared with the previous frame and it is difficult to describe this change by the motion vector, so it is subdivided. Therefore, the RGB image of the sub-region is sent into the image semantic segmentation CNN model trained in S2 for re-segmentation, and a new semantic segmentation result F of the sub-region is obtained_R(t)：

F_R(t)＝φ(I(t)[R(t)])

Wherein i (t) [ r (t) ] represents an RGB image of the r (t) sub-region;

F(t)[R(t)]＝F_R(t)

wherein, F (t) [ R (t) ] represents the segmentation result of the R (t) sub-area in the current t-th frame. The remaining subregion segmentation results, except for the r (t) subregions, remain unchanged.

The efficiency of the non-key frame segmentation algorithm based on the steps is much higher than that of a method for segmenting the similar images frame by frame through CNN, and the processing speed of the method for the P frame is dozens of times higher than that of the method for processing the P frame by avoiding redundant feature extraction of the similar images.

S6, repeating the steps S3 and S4 for all frames in the video until the video stream processing is finished, and finishing the semantic segmentation of all video frames.

In the embodiment, the semantic video segmentation method firstly trains a convolution neural network model for semantic segmentation of a static picture, on the basis, the strong correlation between the front frame and the rear frame of the video is utilized, the motion information of a video compression domain is fully explored, the problem of feature extraction and classification is converted into the problem of pixel movement between adjacent video frames, and the sub-region possibly generating larger errors is finely segmented based on the principle of the compression model, so that the high accuracy is maintained while the high model operation speed is achieved.

The method has very strong generalization capability, and the framework can be applied to other pixel domain identification tasks of more videos, including video target detection, video instance segmentation, video panorama segmentation and the like. The speed of the model does not depend on a specific CNN network structure, and the speed of the high-precision model and the light-weight model is improved by several times to tens of times.

Examples

The following simulation experiment is performed based on the above method, and the implementation method of this embodiment is as described above, and the specific steps are not described in detail, and only the experimental results are shown below.

ICNet is used as a lightweight image semantic segmentation CNN model in the embodiment, multiple experiments are carried out on a semantic segmentation public data set Cityscapes, the experiments comprise 5000 video short segments, the fact that the method can obviously improve the efficiency of semantic video segmentation and guarantee accuracy is proved, in an algorithm, a picture group GOP parameter g is set to be 12, and the B frame ratio β is set to be 0.

Compared with the traditional method of performing segmentation processing by CNN frame by frame, the method of the invention is mainly distinguished whether the compressed domain operation of S3-S5 is performed or not according to the algorithm flow. The effects of the two methods are shown in table 1.

TABLE 1 Effect of the invention on the Cityscapes dataset

Therefore, through the technical scheme, the embodiment of the invention develops a real-time semantic video segmentation method based on the deep learning technology. The invention can fully utilize the motion information in the video compression domain to model the correlation relation of adjacent frames in the video, and further uses the correlation to reduce redundant calculation, thereby greatly accelerating the model speed of video semantic segmentation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A real-time semantic video segmentation method is characterized by comprising the following steps:

s2, training a lightweight image semantic segmentation CNN model;

2. The real-time semantic video segmentation method according to claim 1, wherein in step S1, for each video V used for video semantic segmentation, an algorithm object is defined as: the classification of all the pixels of each frame of image in the video V is detected.

3. The real-time semantic video segmentation method according to claim 2, wherein in step S2, training the lightweight image semantic segmentation CNN model specifically includes:

4. The real-time semantic video segmentation method of claim 3, wherein in step S3, the video is encoded and decoded by using MPEG-4 video encoding and decoding standard, and the GOP parameter g and B frame ratio β are set, and when the current frame time is t, the decoding process is as follows:

5. The real-time semantic video segmentation method according to claim 4, wherein in step S4, if the current t-th frame is an I-frame, the current t-th frame is semantically segmented according to the following algorithm:

6. The real-time semantic video segmentation method according to claim 5, wherein in step S5, if the current t-th frame is a P-frame, the current t-th frame is semantically segmented according to the following algorithm:

F(t)[p]＝F(t-1)[p-Mv(t)[p]]

F_R(t)＝φ(I(t)[R(t)])

Wherein i (t) [ r (t) ] represents an RGB image of the r (t) sub-region;

F(t)[R(t)]＝F_R(t)