CN112465826A

CN112465826A - Video semantic segmentation method and device

Info

Publication number: CN112465826A
Application number: CN201910840038.2A
Authority: CN
Inventors: 吴长虹; 张明; 邝宏武
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2021-03-09
Anticipated expiration: 2039-09-06
Also published as: CN112465826B

Abstract

The embodiment of the invention provides a video semantic segmentation method and a video semantic segmentation device, wherein the method comprises the following steps: acquiring an image sequence according to a video image, wherein the image sequence comprises a key frame image and a non-key frame image; inputting the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, and inputting the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, wherein the trained first deep neural network comprises a first full convolution network, the trained second deep neural network comprises a second full convolution network, and the number of channels of the first full convolution network is greater than that of channels of the second full convolution network; and obtaining a semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result. The video semantic segmentation method and device provided by the embodiment of the invention can solve the problems of large time consumption and large calculation amount in the prior art.

Description

Video semantic segmentation method and device

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a video semantic segmentation method and device.

Background

The semantic segmentation is to group image pixels according to different semantic meanings expressed in the images, and in the field of automatic driving, after relevant images of roads are detected through a vehicle-mounted camera or a laser radar, the images can be segmented and classified according to the semantic segmentation so as to avoid obstacles such as pedestrians and vehicles, so that the semantic segmentation of road scenes has very important significance for automatic driving.

The existing road scene semantic segmentation method divides the detected video into one frame and one frame of image, obtains the global and local context information of each frame of image, and performs semantic segmentation on each frame of image respectively. In practice, the collected road scene is usually presented in a video mode, and one video may include many frames of images, and for each frame of image, semantic segmentation is required. Therefore, when performing semantic segmentation on a plurality of frame images, because the number of images is large, the time consumption of the semantic segmentation method is usually large on the basis of ensuring the segmentation performance, and the calculation amount of the model is large.

Therefore, a video semantic segmentation method is needed to solve the problems of large time consumption and large calculation amount in the prior art.

Disclosure of Invention

The embodiment of the invention provides a video semantic segmentation method and device, and aims to solve the problems of large time consumption and large calculation amount in the prior art.

In a first aspect, an embodiment of the present invention provides a video semantic segmentation method, including:

acquiring an image sequence according to a video image, wherein the image sequence comprises a key frame image and a non-key frame image;

inputting the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, and inputting the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, wherein the trained first deep neural network comprises a first full convolution network, the trained second deep neural network comprises a second full convolution network, and the number of channels of the first full convolution network is greater than that of channels of the second full convolution network;

and obtaining a semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

In a possible implementation manner, the trained first deep neural network further includes a first timing memory unit, and the trained second deep neural network further includes a second timing memory unit; the inputting the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, and inputting the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, including:

inputting a first memory unit state of a first image and the key frame image into the trained first deep neural network to obtain a first semantic segmentation result of the key frame image, wherein the first image is a previous frame image of the key frame image, and the first memory unit state is used for indicating a difference characteristic between the first image and the previous frame image of the first image;

and inputting a second memory unit state of a second image and the non-key frame image into the trained second deep neural network to obtain a second semantic segmentation result of the non-key frame image, wherein the second image is a previous frame image of the non-key frame image, and the second memory unit state is used for indicating a difference characteristic between the second image and the previous frame image of the second image.

In a possible implementation manner, the first timing memory unit is specifically a first convolution long-short term memory network, and the inputting the first memory unit state of the first image and the key frame image into the trained first deep neural network to obtain the first semantic segmentation result of the key frame image includes:

inputting the key frame image into the first full convolution network to obtain a corresponding first semantic segmentation feature;

inputting the first semantic segmentation feature and the first memory unit state of the first image into the first convolution long-short term memory network to obtain a first semantic segmentation result of the key frame image;

the second time-series memory unit is specifically a second convolution long-short term memory network, and the method includes the steps of inputting a second memory unit state of a second image and the non-key frame image into the trained second deep neural network to obtain a second semantic segmentation result of the non-key frame image, and further includes:

inputting the non-key frame image into the second full convolution network to obtain a corresponding second semantic segmentation feature;

and inputting the second semantic segmentation characteristic and the second memory unit state of the second image into the second convolution long-short term memory network to obtain a second semantic segmentation result of the non-key frame image.

In a possible implementation manner, the obtaining a semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result includes:

performing upsampling on the first semantic segmentation result to obtain a first segmentation image, wherein the first segmentation image is consistent with the key frame image in size;

performing upsampling on the second semantic segmentation result to obtain a second segmentation image, wherein the size of the second segmentation image is consistent with that of the non-key frame image;

and obtaining the semantic segmentation result of the video image according to the first segmentation image and the second segmentation image.

In one possible implementation, the trained first deep neural network and the trained second deep neural network are obtained by:

acquiring a sample image sequence and a sample annotation result, wherein the sample image sequence comprises a sample key frame image and a sample non-key frame image, and the sample annotation result comprises annotation information of semantic segmentation of the sample key frame image and the sample non-key frame image;

obtaining the trained first deep neural network according to a first deep neural network, the sample key frame and labeling information of semantic segmentation of the sample key frame, wherein the first deep neural network is constructed by a first full convolution network and a first time sequence memory unit;

processing the first deep neural network to obtain a second deep neural network;

and obtaining a trained second deep neural network according to the second deep neural network, the sample non-key frame image and the labeling information of the semantic segmentation of the sample non-key frame image.

In a possible implementation manner, the obtaining a trained first deep neural network according to the sample key frame and the labeling information of the semantic segmentation of the sample key frame includes:

inputting a sample memory cell state of a first sample image and the sample key frame image into the first deep neural network to obtain a first sample memory cell state and a first sample segmentation feature of the sample key frame image, wherein the first sample image is a previous frame image of the sample key frame image, and the sample memory cell state of the first sample image is used for indicating a difference feature of the first sample image and the previous frame image of the first sample image;

obtaining a first loss function according to the labeling information of semantic segmentation of the sample key frame image and the first sample segmentation feature;

and adjusting the weight parameters of the first deep neural network according to the first loss function to obtain the trained first deep neural network.

In a possible implementation manner, the obtaining a trained second deep neural network according to the second deep neural network, the sample non-key frame image, and the labeling information of semantic segmentation of the sample non-key frame image includes:

inputting a sample memory cell state of a second sample image and the sample non-key frame image into the second deep neural network to obtain a second sample memory cell state and a second sample segmentation feature of the sample non-key frame image, wherein the second sample image is a previous frame image of the sample non-key frame image, and the sample memory cell state of the second sample image is used for indicating a difference feature between the second sample image and the previous frame image of the second sample image;

obtaining a second loss function according to the labeling information of the semantic segmentation of the sample non-key frame image and the second sample segmentation characteristic;

and adjusting the weight parameters of the second deep neural network according to the second loss function to obtain the trained second deep neural network.

In a possible implementation manner, the processing the first deep neural network to obtain a second deep neural network includes:

cutting the number of channels of the first full convolution network and/or the number of convolution layers of the first full convolution network to obtain a second full convolution network;

and obtaining the second deep neural network based on the second full convolution network and the second time sequence memory unit.

In a second aspect, an embodiment of the present invention provides a video semantic segmentation apparatus, including:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring an image sequence according to a video image, and the image sequence comprises a key frame image and a non-key frame image;

the processing module is used for inputting the key frame images into a trained first deep neural network to obtain a first semantic segmentation result, and inputting the non-key frame images into a trained second deep neural network to obtain a second semantic segmentation result, wherein the trained first deep neural network comprises a first full convolution network, the trained second deep neural network comprises a second full convolution network, and the number of channels of the first full convolution network is greater than that of channels of the second full convolution network;

and the segmentation module is used for obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

In a possible implementation manner, the trained first deep neural network further includes a first timing memory unit, and the trained second deep neural network further includes a second timing memory unit; the processing module is specifically configured to:

In a possible implementation manner, the first timing memory unit is specifically a first convolutional long-short term memory network, and the processing module is specifically further configured to:

the second time-series memory unit is specifically a second convolutional long-short term memory network, and the processing module is further specifically configured to:

In a possible implementation manner, the segmentation module is specifically configured to:

In one possible implementation manner, the system further includes a training module, and the training module is configured to:

In a possible implementation manner, the training module is further specifically configured to:

In a third aspect, an embodiment of the present invention provides a video semantic segmentation apparatus, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the video semantic segmentation method according to any one of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the video semantic segmentation method according to any one of the first aspect is implemented.

The video semantic segmentation method and device provided by the embodiment of the invention firstly acquire an image sequence according to a video image, divide the image sequence into a key frame image and a non-key frame image, then input the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, input the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, and finally obtain a semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result. According to the embodiment of the invention, the non-key frame images in the image sequence are input into the trained second deep neural network, and the number of channels of the second full convolution network in the trained second deep neural network is less than that of the channels of the first full convolution network in the trained first deep neural network, so that compared with the prior art, the time consumption of the video semantic segmentation method provided by the embodiment of the invention is reduced, and the calculation amount of the model is smaller.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a diagram illustrating semantic segmentation of an image according to an embodiment of the present invention;

fig. 2 is a schematic view of an application scenario of video semantic segmentation according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a video semantic segmentation method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a training process of a first deep neural network and a second deep neural network provided in an embodiment of the present invention;

FIG. 5 is a schematic diagram of the internal structure of ConvLSTM according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a first deep neural network according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a second deep neural network according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a video semantic segmentation method according to another embodiment of the present invention;

FIG. 9 is a video semantic segmentation framework diagram provided by an embodiment of the present invention;

FIG. 10 is a diagram illustrating a segmentation result of a first trained deep neural network on a keyframe image according to an embodiment of the present invention;

FIG. 11 is a diagram illustrating a segmentation result of a non-key frame image by a trained second deep neural network according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a video semantic segmentation apparatus according to an embodiment of the present invention;

fig. 13 is a schematic hardware structure diagram of a video semantic segmentation apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

To facilitate understanding, terms referred to in the present application are first explained.

Image semantic segmentation: image pixels are grouped according to differences in the image expressing semantic meaning.

The goal of image semantic segmentation is to label the category to which each pixel of an image belongs, in the actual segmentation, the instance of the same category is not separated, only the category of each pixel in the image is concerned, and if two objects of the same category exist in the image, the image semantic segmentation does not distinguish the two objects of the same category into separate objects. Fig. 1 is a schematic diagram of semantic segmentation of an image provided by an embodiment of the present invention, as shown in fig. 1, the left side is an image to be segmented, the content of the image is that a person rides a motorcycle, the image has three categories, namely, a person 10, a motorcycle 20 and a background 30, and the purpose of the semantic segmentation of the image is to distinguish the person 10, the motorcycle 20 and the background 30. In the actual segmentation, the segmentation is performed for each pixel of the image, and as shown in fig. 1, the right side is the result of semantic segmentation of the image, including the segmented person 100, the segmented motorcycle 200, and the segmented background 300.

Video semantic segmentation: and performing semantic segmentation on the image sequence.

The video is composed of a series of images, and the video semantic segmentation is to convert the video into a corresponding image sequence and perform image semantic segmentation on each image.

Fig. 2 is a schematic view of an application scenario of video semantic segmentation according to an embodiment of the present invention, as shown in fig. 2, the application scenario includes a monitoring device 21 and a server 22, the monitoring device 21 and the server 22 are connected by a wired or wireless connection, and the number of the monitoring device 21 is one or more, and the monitoring device is mainly used for acquiring a video image and sending the video image to the server 22. The server 22 obtains a sequence of images from the video images, wherein the sequence of images includes key frame images and non-key frame images. The server 22 inputs the key frame image into the trained first deep neural network to obtain a first semantic segmentation result, inputs the non-key frame image into the trained second deep neural network to obtain a second semantic segmentation result, and then obtains the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

The system provided by the embodiment of the invention can be applied to various scenes, such as the fields of advanced vehicle auxiliary driving or automatic driving. In the field of automatic driving, the monitoring device 21 may specifically be an in-vehicle camera, a sensor, or the like. The monitoring device 21 acquires video images of a road scene and then sends them to the server 22. The server 22 performs semantic segmentation on the video image, and can be applied to recognition of obstacles on a road to help vehicles on the road avoid various obstacles, safe driving, and the like.

The technical solution of the present invention and how to solve the above technical problems will be described in detail with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 3 is a schematic flow chart of a video semantic segmentation method according to an embodiment of the present invention, as shown in fig. 3, including:

s31, an image sequence is acquired from the video images, the image sequence including key frame images and non-key frame images.

The video is composed of images of one frame and one frame, and for a section of video image to be segmented, the video image is firstly converted into an image sequence, and the image sequence is obtained by arranging the images in the video image in sequence. The setting of the key frame images and the non-key frame images can be according to actual needs, for example, one possible setting is to set a key frame every fixed number of frames, a non-key frame is set between two key frames, another possible setting is to manually select partial images in the image sequence as key frame images, the remaining images are non-key frame images, and the number of images between the key frame images can be determined at will. The specific setting method is not particularly limited here.

And S32, inputting the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, and inputting the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, wherein the trained first deep neural network comprises a first full convolution network, the trained second deep neural network comprises a second full convolution network, and the number of channels of the first full convolution network is greater than that of the channels of the second full convolution network.

The full convolution network can extract high-level features with more representation capability of an image, in the embodiment of the invention, a key frame image is input into a trained first deep neural network, and the key frame image is subjected to semantic segmentation through the trained first deep neural network to obtain a first semantic segmentation result, wherein the trained first deep neural network is obtained by training according to a sample video image. Similarly, the non-key frame image is input into the trained second deep neural network, and the semantic segmentation is performed on the non-key frame image through the trained second deep neural network to obtain a second semantic segmentation result. The first deep neural network comprises a first full convolution network, the second deep neural network comprises a second full convolution network, the number of channels of the first full convolution network is larger than that of channels of the second full convolution network, so that the number of the features extracted by the trained second deep neural network on the non-key frame images is less than that of the features extracted by the trained first deep neural network on the key frame images, and the time consumption and the calculation amount of the semantic segmentation of the non-key frame images can be reduced due to the reduction of the number of the channels.

S33, obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

And obtaining a semantic segmentation result of the video image after obtaining a first semantic segmentation result corresponding to the key frame image and a second semantic segmentation result corresponding to the non-key frame image.

The video semantic segmentation method provided by the embodiment of the invention comprises the steps of firstly obtaining an image sequence according to a video image, dividing the image sequence into a key frame image and a non-key frame image, then inputting the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, and finally obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result. According to the embodiment of the invention, the non-key frame images in the image sequence are input into the trained second deep neural network, and the number of channels of the second full convolution network in the trained second deep neural network is less than that of the channels of the first full convolution network in the trained first deep neural network, so that compared with the prior art, the time consumption of the video semantic segmentation method provided by the embodiment of the invention is reduced, and the calculation amount of the model is smaller.

The following describes the training process of the trained first deep neural network and the trained second deep neural network in detail with reference to fig. 4 and using a specific embodiment. Fig. 4 is a schematic flowchart of a training process of a first deep neural network and a second deep neural network provided in an embodiment of the present invention, as shown in fig. 4, including:

s41, obtaining a sample image sequence and a sample labeling result, wherein the sample image sequence comprises a sample key frame image and a sample non-key frame image, and the sample labeling result comprises labeling information of semantic segmentation of the sample key frame image and the sample non-key frame image.

First, a sample image sequence and a sample annotation result are obtained, wherein the sample image sequence can be divided into a sample key frame image and a sample non-key frame image. The sample key frame image and the sample non-key frame image are distinguished, a part of the images in the sample image sequence can be set as the sample key frame image, and the rest of the images can be used as the sample non-key frame images. The number of image frames between every two adjacent sample key frame images may be the same or different.

After a sample image sequence is divided into a sample key frame image and a sample non-key frame image, a sample annotation result needs to be obtained, wherein the sample annotation result comprises annotation information of semantic segmentation of each frame of the sample key frame image and annotation information of semantic segmentation of each frame of the sample non-key frame image. That is, the sample labeling result includes a result of semantic segmentation on each frame of sample image, where the sample labeling result may be obtained by manually labeling the sample image or by labeling with a labeling tool.

S42, obtaining the trained first deep neural network according to the first deep neural network, the sample key frame and the labeled information of semantic segmentation of the sample key frame, wherein the first deep neural network is obtained by constructing a first full convolution network and a first time sequence memory unit.

The existing segmentation methods such as K-means clustering, Grab-Cut and the like mainly depend on low-level image pixel characteristics, and the characteristic expression capability cannot meet the application requirements of complex road scenes. Compared with the existing segmentation method, the embodiment of the invention adopts the full convolution network to extract the high-level features with more characterization capability, such as FCN, deep Lab, SegNet and the like. The time sequence Memory unit is used for linking semantic segmentation context information between a front frame and a rear frame of a video image sequence, in the embodiment of the invention, the time sequence Memory unit adopts a Convolution Long Short Term Memory network (ConvlsTM), the ConvlsTM is a form that input-to-state and state-to-state parts are replaced by Convolution from feedforward calculation on the basis of a Fully Connected Long Short Term Memory network (full Connected Long Short Term-Term Memory, FC-LSTM), the feedforward calculation of FC-LSTM can pull features into one-dimensional vectors, and therefore space information is lost, and the ConvlsTM not only has the time sequence capability of LSTM, but also can draw local features like CNN. Fig. 5 is a schematic diagram of an internal structure of ConvLSTM provided in the embodiment of the present invention, and as shown in fig. 5, the working principle of ConvLSTM is as follows:

where ". X" denotes convolution operation, "o" denotes Hadamard matrix multiplication, i.e. multiplication of corresponding elements, i_tFor inputting thresholds, controlling the input of features, f_tIs a feature threshold indicating features that do not need to be passed to the next frame image, i_tAnd f_tCo-determination of C_t，C_tFor continuously remembering parts of image features, i.e. information of different feature parts in preceding and succeeding frame images, o_tTo output the threshold, σ is the activation function.

The first deep neural network is constructed based on the first full convolution network and the first timing memory unit, and the construction process of the first deep neural network is described below. For example, a first deep neural network can be designed based on the residual network ResNet-34. The ResNet-34 includes 5 convolution modules, 1 average pooling layer and 1 full-connected layer, fig. 6 is a schematic structural diagram of the first deep neural network according to the embodiment of the present invention, and as shown in fig. 6, the last average pooling layer and the full-connected layer of the first deep neural network designed according to the embodiment of the present invention are removed. The first convolution module comprises a convolution layer and a ReLU layer, the size of the convolution kernel is 7x7, the number of channels is 64, and the interval of the pooling layers is 2; the second convolution module comprises three convolution layers and a ReLU layer, the size of a convolution kernel is 3x3, the number of channels is 128, and the interval of the pooling layers is 2; the third convolution module comprises 4 residual modules, the convolution kernel size is 3x3, the channel number is 128, and the interval of the pooling layer is 2; the fourth convolution module comprises 6 residual modules, the convolution kernel size is 3x3, the channel number is 192, and the interval of the pooling layer is 2; the fifth convolution module contains 3 residual modules, the convolution kernel size is 3x3, the number of channels is 192, and the pooling layer interval is 2.

Firstly, inputting a sample memory cell state of a first sample image and the sample key frame image into the first deep neural network to obtain a first sample memory cell state and a first sample segmentation feature of the sample key frame image, wherein the first sample image is a previous frame image of the sample key frame image, and the sample memory cell state of the first sample image is used for indicating a difference feature between the first sample image and the previous frame image of the first sample image.

And obtaining a first loss function according to the marking information of the semantic segmentation of the sample key frame image and the first sample segmentation characteristic.

In the training process, the sample key frame image t is subjected to one-time forward propagation by adopting a first deep neural network to obtain a semantic segmentation feature x_tThe memory cell state (H) of the sample of the first sample image at the previous time is input to ConvLSTM_t-1,C_t-1) And semantic segmentation feature x_tOutputting the first sample memory cell state (H)_t,C_t) And a first sample division feature o obtained through the memory cell_t. To o is_tAnd performing upsampling to obtain a first sample segmentation map. And comparing the labeling information of the semantic segmentation of the sample key frame with the first sample segmentation graph, and calculating gradient to update the network parameters of the first deep neural network by taking the loss of the lovasz loss and the softmax loss as an objective function.

And S43, processing the first deep neural network to obtain a second deep neural network.

The second deep convolutional network design criterion is to reduce the calculation amount of the network as much as possible on the basis of ensuring the segmentation performance and reduce the time consumption of the semantic segmentation process. The design of the second deep neural network of the embodiment of the present invention is appropriately tailored to the number of layers or the number of channels of each layer compared to the first deep neural network, fig. 7 is a schematic structural diagram of the second deep neural network provided in the embodiment of the present invention, and as shown in fig. 7, one of the settings adopted by the second deep neural network is to reduce the number of channels of the first 4 convolution modules of the first full convolution network by half to obtain a second full convolution network, and then obtain the second deep neural network based on the second full convolution network and a second time sequence memory unit, where the second time sequence memory unit is ConvLSTM specifically.

And S44, obtaining the trained second deep neural network according to the second deep neural network, the sample non-key frame image and the labeling information of the semantic segmentation of the sample non-key frame image.

Firstly, inputting a sample memory unit state of a second sample image and the sample non-key frame image into the second deep neural network to obtain a second sample memory unit state and a second sample segmentation feature of the sample non-key frame image, wherein the second sample image is a previous frame image of the sample non-key frame image, and the sample memory unit state of the second sample image is used for indicating a difference feature between the second sample image and the previous frame image of the second sample image;

Similarly, for the non-key frame t +1, a forward propagation is performed by using a second deep neural network, and the ConvLSTM unit inputs the state (H) of the sample memory unit of the second sample image at the previous moment_t,C_t) And segmentation feature x_t+1Outputting the second sample recordMemory cell state (H)_t+1,C_t+1) And a second sample segmentation feature o obtained via a memory unit_t+1. The second deep neural network parameters are updated in a manner similar to the updating of the first deep neural network.

And respectively adding an upsampling operation on the basis of the first deep neural network and the second deep neural network to obtain a segmentation map with the size consistent with that of the original image, thereby obtaining the end-to-end first deep neural network and second deep neural network.

The video semantic segmentation method provided by the embodiment of the invention comprises the steps of firstly obtaining an image sequence according to a video image, dividing the image sequence into a key frame image and a non-key frame image, then inputting the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, and finally obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result. In the training process of the first deep neural network and the second deep neural network, the last time state of the long and short time memory unit buffer image sequence is added on the basis of the full convolution network, and partial features of the previous frame image are transmitted to the next frame image, so that the segmentation performance of the single frame image can be improved by utilizing the sequence relation of the previous frame and the next frame. Meanwhile, the first deep neural network is adopted to extract the segmentation characteristic information of the key frame image, the time sequence memory unit is used for assisting the second deep neural network in segmenting the non-key frame image, and the first deep neural network and the second deep neural network are combined, so that the overall calculation amount of a video semantic segmentation task is reduced, and the time consumption is reduced.

The solution of the present application is illustrated below with a specific example.

Fig. 8 is a schematic flow chart of a video semantic segmentation method according to another embodiment of the present invention, as shown in fig. 8, including:

and S81, acquiring the video image and converting the video image into an image sequence.

The existing image semantic segmentation is generally directed to a single-frame image, each image is related to a previous frame image and a next frame image thereof in a video image, some features between frames are the same in an image sequence, and the embodiment of the invention is realized based on the similarity of the features between the previous frame image and the next frame image.

S82, the image sequence is divided into key frame images and non-key frame images.

Specifically, for the image sequence, one key frame image t may be set every fixed number of frames k, and a non-key frame image may be set between every two adjacent key frame images t.

S83, inputting the first memory cell state of the first image and the key frame image into the trained first deep neural network to obtain a first semantic segmentation result of the key frame image.

The trained first deep neural network comprises a first full convolutional network and a first timing memory unit, and the first timing memory unit is ConvLSTM. And inputting a first memory unit state of a first image and the key frame image into the trained first deep neural network to obtain a first semantic segmentation result of the key frame image, wherein the first image is a previous frame image of the key frame image, and the first memory unit state is used for indicating a difference characteristic between the first image and the previous frame image of the first image.

Specifically, the key frame image is input into the first full convolution network to obtain a corresponding first semantic segmentation feature;

and inputting the first semantic segmentation feature and the first memory unit state of the first image into the first convolution long-short term memory network to obtain a first semantic segmentation result of the key frame image.

And S84, inputting the second memory cell state of the second image and the non-key frame image into the trained second deep neural network to obtain a second semantic segmentation result of the non-key frame image.

Specifically, the non-key frame image is input into the second full convolution network to obtain a corresponding second semantic segmentation feature;

S85, upsampling the first semantic segmentation result to obtain a first segmentation image, wherein the first segmentation image is consistent with the size of the key frame image, and upsampling the second semantic segmentation result to obtain a second segmentation image, wherein the size of the second segmentation image is consistent with that of the non-key frame image.

In the embodiment of the invention, the first semantic segmentation result and the second semantic segmentation result are up-sampled, so that a first segmentation graph with the size consistent with that of the key frame image and a second segmentation graph with the size consistent with that of the non-key frame image are obtained. And obtaining a semantic segmentation result of the video image according to the first segmentation image and the second segmentation image.

Fig. 9 is a video semantic segmentation frame diagram provided in an embodiment of the present invention, and as shown in fig. 9, a video image is divided into image sequences (t, t +1, t +2, t + 3.), where the image t is a key frame image, and the images t +1, t +2, and t +3 are non-key frame images. Inputting the key frame image t into a first full convolution network to obtain a first semantic segmentation feature, inputting the first semantic segmentation feature and a first memory unit state of a first image into a first convolution long-short term memory network to obtain a first semantic segmentation result of the key frame image t and a memory unit state of the first convolution long-short term memory network, wherein the memory unit state of the first convolution long-short term memory network is subjected to one-time forward propagation. After the non-key frame image t +1 is input into a second full convolution network to obtain a second semantic segmentation characteristic, the memory unit state of the first convolution long-short term memory network and the second semantic segmentation characteristic are input into the second convolution long-short term memory network to obtain a second semantic segmentation result of the non-key frame image t +1 and a memory unit state of the second convolution long-short term memory network, wherein the memory unit state of the second convolution long-short term memory network is subjected to one-time forward propagation. The convolution long and short term memory network caches the characteristics of the corresponding image sequence, partial characteristics of the front frame image and the rear frame image in the video sequence are consistent, semantic segmentation context information between the front frame image and the rear frame image in the video image sequence can be linked through the convolution long and short term memory network and continuously spread forward, and therefore partial characteristics of the front frame image can be transmitted to the rear frame image. On the premise, the number of layers of the first deep neural network or the number of channels of each layer is cut to obtain the second deep neural network, the extracted features are reduced, but each frame of image receives partial features of the previous frame of image, so that the method provided by the embodiment of the invention can reduce the calculation amount and time consumption of the network and can also ensure the performance of video semantic segmentation.

Fig. 10 is a schematic diagram of a segmentation result of a trained first deep neural network on a key frame image according to an embodiment of the present invention, and fig. 11 is a schematic diagram of a segmentation result of a trained second deep neural network on a non-key frame image according to an embodiment of the present invention, where in fig. 10 and 11, the key frame image is a previous frame image of the non-key frame image, and during semantic segmentation, the segmentation result of the trained first deep neural network on the key frame image is transmitted to the trained second deep neural network to help the second deep neural network perform semantic segmentation on the non-key frame image.

Fig. 12 is a schematic structural diagram of a video semantic segmentation apparatus according to an embodiment of the present invention, as shown in fig. 12, including an obtaining module 121, a processing module 122, and a segmentation module 123, where:

the obtaining module 121 is configured to obtain an image sequence according to a video image, where the image sequence includes a key frame image and a non-key frame image;

the processing module 122 is configured to input the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, and input the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, where the trained first deep neural network includes a first full convolution network, the trained second deep neural network includes a second full convolution network, and the number of channels of the first full convolution network is greater than the number of channels of the second full convolution network;

the segmentation module 123 is configured to obtain a semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

In a possible implementation manner, the trained first deep neural network further includes a first timing memory unit, and the trained second deep neural network further includes a second timing memory unit; the processing module 122 is specifically configured to:

In a possible implementation manner, the first timing memory unit is specifically a first convolutional long-short term memory network, and the processing module 122 is further specifically configured to:

In a possible implementation manner, the segmentation module 123 is specifically configured to:

The apparatus provided in the embodiment of the present invention may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 13 is a schematic diagram of a hardware structure of a video semantic segmentation apparatus according to an embodiment of the present invention, and as shown in fig. 13, the video semantic segmentation apparatus includes: at least one processor 131 and memory 132. Wherein the processor 131 and the memory 132 are connected by a bus 133.

Optionally, the model determination further comprises a communication component. For example, the communication component may include a receiver and/or a transmitter.

In a specific implementation, the at least one processor 131 executes computer-executable instructions stored by the memory 132 to cause the at least one processor 131 to perform the video semantic segmentation method as described above.

For a specific implementation process of the processor 131, reference may be made to the above method embodiments, which implement similar principles and technical effects, and this embodiment is not described herein again.

In the embodiment shown in fig. 13, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise high speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the video semantic segmentation method as described above is implemented.

The computer-readable storage medium may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.

The division of the units is only a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A video semantic segmentation method is characterized by comprising the following steps:

2. The method of claim 1, wherein the trained first deep neural network further comprises a first timing memory unit, and the trained second deep neural network further comprises a second timing memory unit; the inputting the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, and inputting the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, including:

3. The method according to claim 2, wherein the first timing memory unit is specifically a first convolution long-short term memory network, and the inputting the first memory unit state of the first image and the key frame image into the trained first deep neural network to obtain the first semantic segmentation result of the key frame image comprises:

4. The method according to claim 1, wherein obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result comprises:

5. The method of claim 2, wherein the trained first deep neural network and the trained second deep neural network are obtained by:

6. The method of claim 5, wherein obtaining the trained first deep neural network according to the first deep neural network, the sample keyframes, and the labeled information of the semantic segmentation of the sample keyframes comprises:

7. The method of claim 5, wherein obtaining a trained second deep neural network from labeling information of semantic segmentation of the second deep neural network, the sample non-key frame image, and the sample non-key frame image comprises:

8. The method of claim 5, wherein the processing the first deep neural network to obtain a second deep neural network comprises:

9. A video semantic segmentation apparatus, comprising:

10. A video semantic segmentation apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the video semantic segmentation method of any of claims 1 to 8.

11. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement the video semantic segmentation method according to any one of claims 1 to 8.