CN109753913B

CN109753913B - Multi-mode video semantic segmentation method with high calculation efficiency

Info

Publication number: CN109753913B
Application number: CN201811622581.7A
Authority: CN
Inventors: 杨绿溪; 顾恒瑞; 朱紫辉; 徐琴珍; 李春国; 黄永明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2023-05-23
Anticipated expiration: 2038-12-28
Also published as: CN109753913A

Abstract

The invention discloses a method for calculating efficient multi-mode video semantic segmentation. The method has three different processing modes for each video frame input. The realization modules are respectively as follows: semantic segmentation module, optical flow module and mixing module. And automatically determining each input video frame to carry out different processing modes through the mode judging module. The method utilizes the position information in the video frames and the optical flow information between frames to combine semantic segmentation with optical flow in space and time. The fine result of the semantic segmentation module is reserved, and the operation speed is greatly improved due to the combination of the optical flow. Compared with the widely applied deeplab, the method has the running speed of 2fps, and the method realizes the rapid semantic segmentation with the running speed of 12fps on the cityscapes data set. Compared with the existing method, the method better obtains the compromise between precision and processing speed.

Description

Multi-mode video semantic segmentation method with high calculation efficiency

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-mode video semantic segmentation method with high calculation efficiency.

Background

Semantic segmentation is one of the key problems in the current computer vision field, and aims to classify targets at the pixel level, and is a task applied to scene understanding. Scene understanding is increasingly important as a core technology of computer vision, because more and more application scenes in reality need to infer related knowledge or semantics from images, and precise and efficient segmentation technology is needed. Such applications include autopilot, human-machine interaction, computed photography, image search engines, augmented reality, and the like.

Semantic segmentation is not an isolated field, but a natural step of the computer reasoning degree from rough to fine development, and is a result of continuous development based on target detection or classification and the like. Semantic segmentation enables fine-grained speculation, i.e., dense inferences are made for an image to give labels for each pixel.

Traditional image segmentation is an unsupervised learning problem, just dividing similar pixels together, without having to carry a class training sample. Traditional computer vision and machine learning techniques have been able to solve the problem of scene understanding, but have not been able to accurately segment objects. Image semantic segmentation studied in recent years is a supervised learning problem, and target recognition is performed by using training samples with categories. Image semantic segmentation combines two techniques of segmentation and object recognition, and can segment images into regions with advanced semantic content. For example, by semantic segmentation, one image can be segmented into regions having four different semantics of "vehicle", "pedestrian", "tree", and "road", respectively.

At present, the most successful and most advanced semantic segmentation deep learning technology is derived from a full convolutional neural network, and the method is to learn the hierarchical structure of the features by using the existing convolutional neural network as a powerful visual model. The full convolution neural network achieves full convolution by replacing the full connection layer with a convolution layer, while outputting a spatial signature instead of a classification score. These feature maps are up-sampled (also known as deconvoluted) to produce a dense pixel-level signature output. The work has milestone significance, the full convolution neural network shows how to perform end-to-end training aiming at the problem of semantic segmentation, can effectively learn how to perform dense prediction on any-size input, and is a basic stone applied to a semantic segmentation deep degree learning method.

The full convolution neural network solves two problems in realizing semantic segmentation, the first is the problem of a full connection layer. The input is a fixed size image block. The existing solution method is as follows: the final fully connected layer of the convolutional neural network for target recognition is replaced by a convolutional layer. The input of images of any size is realized. Prediction of each pixel of the image is achieved. However, the result of this method is not fine enough, is insensitive to details in the image, ignores the relationship between pixels, and lacks spatial consistency.

Another problem solved by the full convolutional neural network is the problem of pooling layers. There are two different structural solutions to this problem at the present stage.

One is an encoder-decoder architecture. Wherein the encoder uses a pooling layer to gradually reduce the spatial dimension of the input data, and the decoder gradually restores the detail of the target and the corresponding spatial dimension through a network layer such as deconvolution. The convolutional neural network extracts pixel characteristics through a series of convolutional layers and pooling layers, enlarges receptive fields, and then amplifies pictures through deconvolution of a plurality of columns to realize classification of each pixel of the characteristic map. But the picture is reduced and then enlarged, so that the spatial information of the picture is lost to a certain extent.

Another structure is a hole convolution structure. Hole convolution is also called tape Kong Juanji. That is, a blank pixel is inserted between pixels of the previous convolution kernel. Performing convolution operations using a punctured convolution kernel may increase the receptive field of a picture without reducing the size of the picture. The loss of picture space information is reduced. A finer classification effect is achieved. However, this method has a large amount of computation, resulting in low computation efficiency.

The existing semantic segmentation method has reached a segmentation level with higher precision. However, for applications requiring high real-time performance, such as automatic driving, the operation rate of the deep convolutional network is insufficient.

Although semantic segmentation techniques have evolved greatly for individual images, many systems rely on methods that segment each picture frame by frame with the same algorithm when processing image sequences. This approach, while effective, is computationally expensive and completely ignores the temporal continuity and correlation between frames that may be helpful for segmentation. Therefore, the use of correlation between video sequences to reduce the amount of computation while maintaining accuracy for video data using different processing methods than for individual images is an important and leading-edge direction of the current semantic segmentation development.

Disclosure of Invention

In order to solve the problems, the invention provides a multi-mode video semantic segmentation method with high calculation efficiency, which is not only aimed at semantic segmentation of single images, but also mainly solves the problem of operation speed in semantic segmentation by considering correlation among video sequences. In many application scenarios, the accuracy is important, but the input speed capable of reaching or approaching to the common video frame rate is also critical, especially in an automatic driving assistance system, the requirement on real-time processing is high, and for this purpose, the invention provides a computationally efficient multi-mode video semantic segmentation method, which uses three different modes to process the input video frames: the method is characterized by comprising the following specific steps of:

step 1: constructing a semantic segmentation network training sample set, a verification sample set and a test sample set;

step 2: constructing a full convolution network of an output pixel level by adopting a residual network structure and hole convolution: a semantic segmentation module;

step 3: training, testing and verifying the semantic segmentation module to obtain a verified semantic segmentation module;

step 4: constructing a deep learning data set of optical flow estimation;

step 5: constructing a deep learning optical flow module based on feature extraction, wherein the module comprises a convolution extracted feature network structure and an amplified part network structure;

step 6: training the optical flow module by using the data set to obtain the trained optical flow module. The method comprises the steps of carrying out a first treatment on the surface of the

Step 7: and (3) constructing a mixing module: dividing an input video frame into regions, dividing the regions with strong dynamic changes into semantic divisions, and predicting the rest regions by using optical flow information and key frames;

step 8: and a construction mode judging module: setting the first frame of every twenty frames of the video frames as a key frame, setting the first frame of every ten frames as a half key frame, and selecting a semantic segmentation module for processing if the input video frame is the key frame; if the input video frame is a half key frame, selecting a mixing module for processing; the rest frames are used as current frames to be sent into an optical flow module;

step 9: and judging the difference between the optical flow information obtained by the optical flow module and the key frame or half key frame, and if the difference is larger than a given threshold value, forcedly setting the next frame as the key frame.

As a further improvement of the invention, in the step 2, a residual structure and hole convolution are adopted based on a fully connected neural network, wherein the fully connected neural network is obtained by removing a fully connected layer in the network and replacing the fully connected layer with a convolution layer, and the residual structure is used for solving the problem of performance degradation caused by over-deep network.

As a further improvement of the present invention, the optical flow module flow in the step 4 is approximately: and superposing the key frame and the current frame together, and extracting the characteristics through a convolution layer. Then deconvoluting the picture through an amplifying network to obtain an optical flow prediction picture with the same resolution as the input optical flow prediction picture, and fusing the obtained optical flow prediction picture with a feature picture of a key frame to obtain a feature picture of the moment frame.

As a further improvement of the present invention, the mixing module in the step 6: and carrying out semantic segmentation processing on the middle region of the input frame, and predicting the rest regions by combining optical flow information and key frames.

As a further improvement of the present invention, the specific method of the mode discrimination module in the step 7 is as follows: selecting different modes of an input video frame, dividing the input video frame once according to every twenty frames, setting a first frame of the twenty frames as a key frame, setting a tenth frame of the twenty frames as a half key frame, selecting a semantic segmentation module for the video frame judged as the key frame to carry out fine processing, and recording a semantic segmentation result; selecting a mixing module for the video frames which are judged to be half key frames to carry out fine processing on the key areas, and recording semantic segmentation results; and selecting an optical flow module from the rest frames, and combining optical flow information with semantic information of the key frames and the half key frames to perform quick processing.

As a further improvement of the invention, the specific method of the step 8 is as follows: and setting a weighted average of displacement absolute values of optical flow information obtained by the optical flow module of the current video frame as a feedback parameter, setting a threshold for the parameter, forcedly setting the next frame as a key frame if the feedback parameter exceeds the set threshold, and re-dividing the first frame taking the key frame as a new round according to twenty frames.

The invention adopts different processing modes for the input video frames, and compared with the existing semantic segmentation method, the invention replaces the double improvement of speed with a small amount of sacrifice of precision. The processing speed of the current mainstream semantic segmentation method is about 2fps, and the processing speed of the invention reaches 12fps.

Drawings

FIG. 1 is a block flow diagram of an implementation of the present invention;

FIG. 2 is an algorithmic network framework of the present invention;

FIG. 3 is a network block diagram of a semantic segmentation module of the present invention;

FIG. 4 is a diagram of the residual network architecture of the present invention;

FIG. 5 is a schematic view of a strap Kong Juanji of the present invention;

FIG. 6 is a schematic diagram of an optical flow module network of the present invention;

FIG. 7 is a schematic diagram of video frame segmentation in accordance with the present invention;

FIG. 8 is an original drawing of semantic segmentation of the present invention;

FIG. 9 is a result diagram of an implementation of semantic segmentation of the present invention.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and detailed description:

the invention provides a multi-mode video semantic segmentation method with high calculation efficiency, which is not only aimed at semantic segmentation of a single image, but also mainly solves the problem of operation speed in semantic segmentation by considering correlation among video sequences. In many application scenarios, accuracy is important, but input speed that can reach or approach the usual camera frame rate is also critical, especially in automatic driving assistance systems, where the requirement for real-time processing is high.

FIG. 1 shows a block diagram of an implementation of the present invention, including the process of training and implementation of various sub-modules; fig. 2 shows the algorithmic network framework of the invention: the video frame firstly passes through a mode judging module, selects a mode, and then is sent to a corresponding module for processing to obtain a semantic segmentation result

Referring to fig. 1, the construction of the semantic segmentation module of fig. 3 includes the following steps 1 to 2:

step 1, constructing a semantic segmentation network training sample set, a verification sample set and a test sample set:

the standard annotation image set of the leftImg8bit_trainvaltest 8bit LDR format and the fine annotation tag of the gtFine_trainvaltest in the CityScaps data set are used as the data set, and 19 categories and 5000 pictures are contained. Wherein 2975 pictures are training sample sets, 1525 pictures are test sample sets, and 500 pictures are verification sample sets. Prior to training, the dataset is converted to a tfreeord file. Only a training sample set can be used when training is performed; and only a test sample set can be used when testing is performed; also, only the validation sample set can be used in the validation.

Step 2, adopting a residual network structure and hole convolution to build a full convolution network for outputting pixel-level labels: a semantic segmentation module;

and constructing a semantic segmentation module. Fig. 3 is a network block diagram of a semantic segmentation module. The video frame is input, and the output is the semantic segmentation result with the same resolution as the input. The whole network is based on a full convolutional neural network, does not contain a full connection layer, and mainly comprises 7 blocks, wherein the first block only contains one convolutional layer and one pooling layer, and performs primary processing on an input video frame; the last 6 blocks adopt a residual network structure.

A schematic diagram of the residual network structure is shown in fig. 4. Specifically, assuming that the input is X, the output after 3 convolution operations and 2 nonlinear operations (relu activation function) is F (X), and F (X) is added to one equivalent map X of the input to obtain H (X). In the training process, H (X) is not fitted any more, but a residual function H (X) -X, namely F (X), is fitted. The problem of 'the accuracy is reduced along with the deepening of the network layers' and the degradation problem can occur in the deep convolutional neural network, and the residual network structure solves the problem, so that the network is easier to optimize, and the network performance can be improved simply by increasing the layers.

If the structure shown in fig. 4 including 3 convolution operations, 3 nonlinear operations, and 1 addition operation is set as one residual block, block2 includes 3 residual blocks, block3 includes 4 residual blocks, and blocks 4, 5, 6, and 7 each include 3 residual blocks. The block2 and the block3 adopt a standard convolution mode, so that the output size is smaller than the input, and the block4, the block5, the block6 and the block7 adopt a cavity convolution mode, wherein the output size is the same as the input.

The schematic diagram of the hole convolution is shown in fig. 5, in which the convolution kernel size is 5*5, but only the light-colored positions have values, and the rest positions are 0, so that the operand of the convolution kernel is 3*3, which is equivalent to the operand of a 3*3 standard convolution kernel. Although the calculated amount is the same, the receptive field of the 5*5 hole convolution kernel is much larger than that of the 3*3 standard convolution kernel, which is almost 3 times larger.

Convolutional neural networks typically have a pooling layer, one of which is to increase the receptive field and the other is to decrease the feature map size, but the size decrease is undesirable in semantic segmentation because it means that higher-order upsampling is performed when restoring the feature map to the input size. The cavity convolution is adopted, the convolution step length is set to be 1, the consistency of the input and output sizes can be kept under the condition of filling, and meanwhile, the receptive field is increased, the effect of a pooling layer is well replaced, and meanwhile, the multiple adopted in the follow-up process is reduced. Therefore, after the standard convolution is replaced by the cavity convolution in the blocks 4, 5, 6 and 7, the pooling layer can be removed, so that the characteristic diagram obtained by the convolution is more dense, and the final result is more accurate. Finally, the output of block7 is up-sampled to obtain an output graph with the same size as the input.

And step 3, training, testing and verifying the semantic segmentation module, and obtaining a model with ideal segmentation effect through adjusting parameters and an optimizer.

Step 4: constructing a deep learning data set of optical flow estimation;

step 5: an optical flow module is constructed. The optical flow is a method for finding the correspondence existing between the current frame and the next frame by utilizing the change of pixels in an image sequence in a time domain and the correlation between adjacent frames, so as to calculate the motion information of an object between the adjacent frames. The input of the optical flow module is an image of the current frame and the key frame after being linked, and the input is a motion vector of a pixel point of the current frame compared with the key frame, namely, each pixel position has displacement in the x direction and the y direction. Therefore, the segmentation map of the current frame can be completely deduced by combining the semantic segmentation result of the key frame and the displacement information of the pixels between the current frame and the key frame.

FIG. 6 is a schematic diagram of an optical flow module network, in which a current frame and a key frame are added in a channel dimension, that is, two RGB 3 channel pictures are linked to obtain a 6 channel input. The optical flow module network adopts a coder-decoder structure and consists of 6 convolution layers and 4 deconvolution layers. The first 6 convolutional layers are used to extract the feature information while the output size of each convolutional layer is continually reduced from the input size. The output of each convolution layer is saved starting with the 2 nd convolution layer as one of the inputs in the deconvolution process thereafter. In the last 4 deconvolution layers, we do deconvolution operation on the feature map and link the deconvoluted output with the previously corresponding feature map and the optical flow prediction of the upper layer. Each layer improves the resolution by 2 times, the resolution of the finally predicted optical flow is 4 times smaller than that of the input image, and the optical flow with the same size as the input image is obtained through up-sampling. And obtaining the semantic segmentation result of the current frame prediction by fusion by utilizing the segmentation result of the key frame obtained by the semantic segmentation module and the optical flow information between the current frame and the key frame obtained by the optical flow module.

Step 6: training and testing the optical flow module by utilizing the data set to obtain a trained optical flow module;

step 7: a hybrid module is constructed. The mixing module refers to that some parts in the video frame are processed by the semantic segmentation module, and the rest parts are processed by the optical flow module.

Because a large source of video data is a vehicle-mounted camera in an automatic driving auxiliary system, and automatic driving is an important application scene of semantic segmentation, further processing can be performed according to the characteristics of video shot by the vehicle-mounted camera. Here, the video frame is divided into three parts, as shown in fig. 7, a generally dark area is a part that is not greatly changed and has little influence on the computer semantic understanding, and a white area is a region where pedestrians and vehicles are denser. The region motion varies significantly, with portions that may be semantically strongly varying between successive frames. Thus, the blending module performs a zoning process on the same video frame: carrying out semantic segmentation processing on the white region, and updating the segmentation result to the corresponding part of the key frame; and processing the dark region by an optical flow module, and predicting by utilizing optical flow information of the region corresponding to the key frame and the semantic segmentation result of the region corresponding to the key frame instead of re-semantic segmentation.

Step 8: and constructing a mode judging module.

The first frame of every twenty frames of the video sequence is set to be the key frame and the tenth frame is set to be the half key frame. If the input video frame is a key frame, selecting a semantic segmentation module for processing; if the input video frame is a half key frame, selecting a mixing module for processing; the rest frames are used as current frames to be sent into an optical flow module; and judging the difference between the video frame and the key frame or the half key frame by utilizing the optical flow information obtained by the optical flow module, and forcedly setting the video frame of the next frame to be the key frame if the difference is larger than a given threshold value. The processing time of the optical flow module is far smaller than that of the semantic segmentation module, but the processing time is far worse in accuracy and is very dependent on the semantic segmentation result of the key frame, so that the mode discrimination module achieves the aim of greatly improving the transportation speed on the premise of ensuring that the accuracy is not greatly reduced by continuously updating the key frame and the non-key frame.

Step 9, dynamic feedback mechanism. In order to improve the network adaptivity, a dynamic feedback index is increased. Because the optical flow module has higher calculation speed and can obtain the change trend between the current frame and the key frame, the obtained characteristic diagram in the optical flow module is taken as a basis to give an index for judging the difference between the current frame and the key frame. Here a weighted average of the euclidean distance of the displacement vector over the pixel point is used. A threshold is then given, which can be assigned as desired. If the difference index is smaller than the threshold, the network is not interfered, and if the difference index is larger than the threshold, the next frame is forcedly set as a key frame. Here, the operation of calculating the index is not performed for every frame of the optical flow module, and in order to reduce the calculation amount, a mechanism of randomly selecting one frame every ten frames for feedback is adopted.

Fig. 9 is a result of the semantic segmentation of fig. 8. The frames of different classes of objects can be segmented with different colors. The segmentation result is finer.

The invention provides a multi-mode semantic segmentation method for a video sequence, which adopts different processing modes for an input video frame, and compared with the existing semantic segmentation method, the multi-mode semantic segmentation method has the advantage that the speed is doubled by sacrificing a small amount of precision. The processing speed of the current mainstream semantic segmentation method is about 2fps, and the processing speed of the invention reaches 12fps.

Table 1 compares the processing speed of the present invention with other more classical semantic segmentation methods. The processing speed of the invention is an order of magnitude faster than other methods.

Network model	Processing speed (fps)
		FCN	1.2
PSP net	1.6
		Deeplab	2
The invention is that	12

The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any other way, but is intended to cover any modifications or equivalent variations according to the technical spirit of the present invention, which fall within the scope of the present invention as defined by the appended claims.

Claims

1. A computationally efficient multi-mode video semantic segmentation method that uses three different modes to process an input video frame: the method comprises a semantic segmentation module, an optical flow module and a mixing module, wherein different modules are selected for processing according to the characteristics of each video frame through mode discrimination, and the result of compromise between precision and processing speed is obtained through the combination of semantic segmentation and optical flow information in time or space, and the method comprises the following specific steps:

based on a full-connection neural network, adopting a residual structure and hole convolution, wherein the full-connection neural network is obtained by removing a full-connection layer in the network and replacing the full-connection layer with a convolution layer, and the residual structure is used for solving the problem of performance degradation caused by over-deep network;

step 4: constructing a deep learning data set of optical flow estimation;

the flow of the optical flow module is as follows: superposing the key frame and the current frame together, extracting features through a convolution layer, deconvoluting the picture through an amplifying network to obtain an optical flow prediction graph with the same resolution as the input optical flow prediction graph, and fusing the obtained optical flow prediction graph with the feature graph of the key frame to obtain a feature graph of a corresponding time frame;

step 6: training the optical flow module by utilizing the data set to obtain a trained optical flow module;

step 7: and (3) constructing a mixing module:

carrying out semantic segmentation processing on the middle region of the input frame, segmenting out the region with strong dynamic change for semantic segmentation, and predicting the other regions by utilizing optical flow information and key frames;

the specific method comprises the following steps: setting a weighted average of displacement absolute values of optical flow information obtained by an optical flow module of a current video frame as a feedback parameter, setting a threshold for the parameter, forcedly setting a next frame as a key frame if the feedback parameter exceeds the set threshold, and re-dividing a first frame taking the key frame as a new round according to twenty frames;

selecting different modes of an input video frame, dividing the input video frame according to every twenty frames, setting a first frame of the twenty frames as a key frame, setting a tenth frame of the twenty frames as a half key frame, selecting a semantic segmentation module for carrying out fine processing on the video frame judged as the key frame, and recording a semantic segmentation result; selecting a mixing module for the video frames which are judged to be half key frames to carry out fine processing on the key areas, and recording semantic segmentation results; selecting an optical flow module from the rest frames, and combining optical flow information with semantic information of key frames and half key frames to perform quick processing;