CN112465826B

CN112465826B - Video semantic segmentation method and device

Info

Publication number: CN112465826B
Application number: CN201910840038.2A
Authority: CN
Inventors: 吴长虹; 张明; 邝宏武
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2023-05-16
Anticipated expiration: 2039-09-06
Also published as: CN112465826A

Abstract

The embodiment of the invention provides a video semantic segmentation method and a device, wherein the method comprises the following steps: acquiring an image sequence according to a video image, wherein the image sequence comprises a key frame image and a non-key frame image; inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the trained first depth neural network comprises a first full-convolution network, the trained second depth neural network comprises a second full-convolution network, and the number of channels of the first full-convolution network is larger than that of the second full-convolution network; and obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result. The video semantic segmentation method and device provided by the embodiment of the invention can solve the problems of large time consumption and large calculation amount in the prior art.

Description

Video semantic segmentation method and device

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a video semantic segmentation method and device.

Background

The semantic segmentation refers to grouping image pixels according to the difference of semantic meanings expressed in the images, and in the automatic driving field, after the related images of the road are detected through the vehicle-mounted camera or the laser radar, the images can be segmented and classified according to the semantic segmentation so as to avoid obstacles such as pedestrians, vehicles and the like, so that the semantic segmentation of the road scene has very important significance for automatic driving.

The existing semantic segmentation method of the road scene divides the detected video into images of one frame by one frame, acquires global and local context information of each frame of image, and performs semantic segmentation on each frame of image respectively. Since in practice, the collected road scene is usually presented in a video manner, and a video may include many frames of images, for each frame of image, semantic segmentation is required. Therefore, when the multi-frame image is subjected to semantic segmentation, the time consumption of the semantic segmentation method is generally larger on the basis of guaranteeing the segmentation performance due to the larger number of images, and the calculation amount of the model is larger.

Therefore, a video semantic segmentation method is needed to solve the problems of high time consumption and high calculation amount in the prior art.

Disclosure of Invention

The embodiment of the invention provides a video semantic segmentation method and device, which are used for solving the problems of large time consumption and large calculation amount in the prior art.

In a first aspect, an embodiment of the present invention provides a video semantic segmentation method, including:

acquiring an image sequence according to a video image, wherein the image sequence comprises a key frame image and a non-key frame image;

inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the trained first depth neural network comprises a first full-convolution network, the trained second depth neural network comprises a second full-convolution network, and the number of channels of the first full-convolution network is larger than that of the second full-convolution network;

and obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

In one possible implementation, the trained first deep neural network further includes a first timing memory unit, and the trained second deep neural network further includes a second timing memory unit; inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, and inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the method comprises the following steps:

Inputting a first memory unit state of a first image and the key frame image into the trained first depth neural network to obtain a first semantic segmentation result of the key frame image, wherein the first image is a previous frame image of the key frame image, and the first memory unit state is used for indicating the difference characteristics of the first image and the previous frame image of the first image;

and inputting a second memory unit state of a second image and the non-key frame image into the trained second depth neural network to obtain a second semantic segmentation result of the non-key frame image, wherein the second image is a previous frame image of the non-key frame image, and the second memory unit state is used for indicating the difference characteristics of the second image and the previous frame image of the second image.

In one possible implementation manner, the first timing memory unit is specifically a first convolution long-short-term memory network, and the inputting the state of the first memory unit of the first image and the key frame image into the trained first deep neural network to obtain a first semantic segmentation result of the key frame image includes:

Inputting the key frame image into the first full convolution network to obtain a corresponding first semantic segmentation feature;

inputting the first semantic segmentation feature and the first memory unit state of the first image into the first convolution long-short-term memory network to obtain a first semantic segmentation result of the key frame image;

the second time sequence memory unit is specifically a second convolution long-short term memory network, the second memory unit state of the second image and the non-key frame image are input into the trained second deep neural network to obtain a second semantic segmentation result of the non-key frame image, and the method further comprises the following steps:

inputting the non-key frame image into the second full convolution network to obtain a corresponding second semantic segmentation feature;

and inputting the second semantic segmentation feature and the second memory unit state of the second image into the second convolution long-short-term memory network to obtain a second semantic segmentation result of the non-key frame image.

In a possible implementation manner, the obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result includes:

Upsampling the first semantic segmentation result to obtain a first segmentation map, wherein the first segmentation map is consistent with the key frame image in size;

upsampling the second semantic segmentation result to obtain a second segmentation map, wherein the second segmentation map is consistent with the non-key frame image in size;

and obtaining the video image semantic segmentation result according to the first segmentation map and the second segmentation map.

In one possible implementation, the trained first deep neural network and the trained second deep neural network are obtained by:

obtaining a sample image sequence and a sample labeling result, wherein the sample image sequence comprises a sample key frame image and a sample non-key frame image, and the sample labeling result comprises labeling information of semantic segmentation of the sample key frame image and the sample non-key frame image;

obtaining the trained first deep neural network according to the first deep neural network, the sample key frame and the labeling information of semantic segmentation of the sample key frame, wherein the first deep neural network is constructed by a first full convolution network and a first time sequence memory unit;

Processing the first deep neural network to obtain a second deep neural network;

and obtaining the trained second depth neural network according to the second depth neural network, the sample non-key frame image and the labeling information of the semantic segmentation of the sample non-key frame image.

In one possible implementation manner, the obtaining trained first deep neural network according to the sample key frame and the labeling information of the semantic segmentation of the sample key frame includes:

inputting a sample memory unit state of a first sample image and the sample key frame image into the first deep neural network to obtain a first sample memory unit state of the sample key frame image and a first sample segmentation feature, wherein the first sample image is a previous frame image of the sample key frame image, and the sample memory unit state of the first sample image is used for indicating a difference feature between the first sample image and the previous frame image of the first sample image;

obtaining a first loss function according to the labeling information of the semantic segmentation of the sample key frame image and the first sample segmentation feature;

And adjusting the weight parameter of the first deep neural network according to the first loss function to obtain the trained first deep neural network.

In one possible implementation manner, the obtaining trained second depth neural network according to the second depth neural network, the sample non-key frame image and the labeling information of the semantic segmentation of the sample non-key frame image includes:

inputting a sample memory unit state of a second sample image and the sample non-key frame image into the second deep neural network to obtain a second sample memory unit state of the sample non-key frame image and a second sample segmentation feature, wherein the second sample image is a previous frame image of the sample non-key frame image, and the sample memory unit state of the second sample image is used for indicating a difference feature between the second sample image and the previous frame image of the second sample image;

obtaining a second loss function according to the labeling information of the semantic segmentation of the sample non-key frame image and the second sample segmentation feature;

and adjusting the weight parameter of the second deep neural network according to the second loss function to obtain the trained second deep neural network.

In one possible implementation manner, the processing the first deep neural network to obtain a second deep neural network includes:

cutting the channel number of the first full-convolution network and/or the convolution layer number of the first full-convolution network to obtain the second full-convolution network;

and obtaining the second deep neural network based on the second full convolution network and the second time sequence memory unit.

In a second aspect, an embodiment of the present invention provides a video semantic segmentation apparatus, including:

the acquisition module is used for acquiring an image sequence according to the video image, wherein the image sequence comprises a key frame image and a non-key frame image;

the processing module is used for inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the trained first depth neural network comprises a first full-convolution network, the trained second depth neural network comprises a second full-convolution network, and the number of channels of the first full-convolution network is larger than that of the second full-convolution network;

The segmentation module is used for obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

In one possible implementation, the trained first deep neural network further includes a first timing memory unit, and the trained second deep neural network further includes a second timing memory unit; the processing module is specifically configured to:

In a possible implementation manner, the first timing memory unit is specifically a first convolution long-short-term memory network, and the processing module is specifically further configured to:

the second time sequence memory unit is specifically a second convolution long-short term memory network, and the processing module is specifically further configured to:

In one possible implementation manner, the segmentation module is specifically configured to:

In one possible implementation manner, the device further comprises a training module, wherein the training module is used for:

In a possible implementation manner, the training module is specifically further configured to:

In a third aspect, an embodiment of the present invention provides a video semantic segmentation apparatus, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the video semantic segmentation method as set forth in any one of the first aspects.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and when executed by a processor, implement the video semantic segmentation method according to any one of the first aspects.

According to the video semantic segmentation method and device, firstly, an image sequence is obtained according to a video image, the image sequence is divided into a key frame image and a non-key frame image, then the key frame image is input into a trained first depth neural network to obtain a first semantic segmentation result, the non-key frame image is input into a trained second depth neural network to obtain a second semantic segmentation result, and finally the semantic segmentation result of the video image is obtained according to the first semantic segmentation result and the second semantic segmentation result. According to the video semantic segmentation method provided by the embodiment of the invention, the non-key frame images in the image sequence are input into the trained second depth neural network, and the number of channels of the second full convolution network in the trained second depth neural network is smaller than that of the channels of the first full convolution network in the trained first depth neural network, so that compared with the prior art, the time consumption is reduced, and the calculation amount of a model is smaller.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic diagram of image semantic segmentation provided by an embodiment of the present invention;

fig. 2 is a schematic view of an application scenario of video semantic segmentation according to an embodiment of the present invention;

fig. 3 is a flow chart of a video semantic segmentation method according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of a training process of a first deep neural network and a second deep neural network according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an internal structure of a ConvLSTM according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a first deep neural network according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a second deep neural network according to an embodiment of the present invention;

fig. 8 is a flowchart of a video semantic segmentation method according to another embodiment of the present invention;

FIG. 9 is a diagram of a video semantic segmentation framework provided by an embodiment of the present invention;

FIG. 10 is a schematic diagram of a segmentation result of a key frame image by a trained first deep neural network according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a segmentation result of a trained second deep neural network on a non-key frame image according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a video semantic segmentation device according to an embodiment of the present invention;

Fig. 13 is a schematic hardware structure of a video semantic segmentation device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For ease of understanding, the terminology referred to in this application will be explained first.

Image semantic segmentation: the image pixels are grouped according to differences in the semantic meaning expressed in the image.

The object of the image semantic segmentation is to label the class to which each pixel of the image belongs, in the actual segmentation, the instance of the same class is not separated, only the class of each pixel in the image is concerned, and if two objects of the same class exist in the image, the image semantic segmentation does not divide the objects of the two same classes into separate objects. Fig. 1 is a schematic diagram of semantic segmentation of an image, as shown in fig. 1, in which the left side is an image to be segmented, the content of the image is that a person is riding a motorcycle, three categories are respectively a person 10, a motorcycle 20 and a background 30 in the image, and the purpose of the semantic segmentation of the image is to distinguish the person 10, the motorcycle 20 and the background 30. In the actual segmentation, the image is segmented for each pixel, and as shown in fig. 1, the right is the result of the semantic segmentation of the image, including the segmented person 100, the segmented motorcycle 200, and the segmented background 300.

Video semantic segmentation: semantic segmentation is performed on the image sequence.

The video is composed of a series of images, and the video semantic segmentation is to firstly convert the video into a corresponding image sequence and carry out image semantic segmentation on each image.

Fig. 2 is a schematic view of an application scenario of video semantic segmentation provided in an embodiment of the present invention, as shown in fig. 2, including a monitoring device 21 and a server 22, where the monitoring device 21 and the server 22 are connected by a wire or wirelessly, and the number of the monitoring devices 21 is one or more, and the monitoring devices are mainly used for acquiring video images and sending the video images to the server 22. The server 22 obtains a sequence of images from the video image, wherein the sequence of images includes key frame images and non-key frame images. The server 22 inputs the key frame image into the trained first depth neural network to obtain a first semantic segmentation result, inputs the non-key frame image into the trained second depth neural network to obtain a second semantic segmentation result, and then obtains the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

The system provided by the embodiment of the invention can be applied to various scenes, including the field of advanced vehicle auxiliary driving or automatic driving. In the field of autopilot, the monitoring device 21 may in particular be an onboard camera, a sensor or the like. The monitoring device 21 acquires a video image of the road scene and then transmits it to the server 22. The server 22 performs semantic segmentation on the video image, and can be applied to recognition of obstacles on the road to help vehicles on the road avoid various obstacles, safe driving, and the like.

The following describes the technical scheme of the present invention and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 3 is a flow chart of a video semantic segmentation method according to an embodiment of the present invention, as shown in fig. 3, including:

s31, acquiring an image sequence according to the video image, wherein the image sequence comprises a key frame image and a non-key frame image.

The video is composed of a frame-by-frame image, and for a section of video image to be segmented, the video image is firstly transformed into an image sequence, and the image sequence is obtained by arranging the images in sequence in the video image. The setting of the key frame image and the non-key frame image may be according to actual needs, for example, one possible setting is to set a key frame every fixed frame number, the non-key frame is between two key frames, another possible setting is to manually select part of the images in the image sequence as the key frame images, the rest of the images are the non-key frame images, and the number of images between the key frame images may be arbitrarily determined. The specific setting method is not particularly limited at this time.

S32, inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the trained first depth neural network comprises a first full convolution network, the trained second depth neural network comprises a second full convolution network, and the number of channels of the first full convolution network is larger than that of the second full convolution network.

In the embodiment of the invention, a key frame image is input into a trained first depth neural network, semantic segmentation is carried out on the key frame image through the trained first depth neural network, and a first semantic segmentation result is obtained, wherein the trained first depth neural network is obtained by training according to a sample video image. And similarly, inputting the non-key frame image into a trained second depth neural network, and performing semantic segmentation on the non-key frame image through the trained second depth neural network to obtain a second semantic segmentation result. The first depth neural network comprises a first full convolution network, the second depth neural network comprises a second full convolution network, the number of channels of the first full convolution network is larger than that of the second full convolution network, so that the characteristics of the trained second depth neural network on the extraction of the non-key frame images are less than those of the trained first depth neural network on the extraction of the key frame images, and the time consumption and the calculation amount of semantic segmentation of the non-key frame images can be reduced due to the reduction of the number of channels.

S33, obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

And obtaining a first semantic segmentation result corresponding to the key frame image and a second semantic segmentation result corresponding to the non-key frame image, and obtaining the semantic segmentation result of the video image.

According to the video semantic segmentation method provided by the embodiment of the invention, firstly, an image sequence is acquired according to a video image, the image sequence is divided into a key frame image and a non-key frame image, then the key frame image is input into a trained first depth neural network to obtain a first semantic segmentation result, the non-key frame image is input into a trained second depth neural network to obtain a second semantic segmentation result, and finally, the semantic segmentation result of the video image is obtained according to the first semantic segmentation result and the second semantic segmentation result. According to the video semantic segmentation method provided by the embodiment of the invention, the non-key frame images in the image sequence are input into the trained second depth neural network, and the number of channels of the second full convolution network in the trained second depth neural network is smaller than that of the channels of the first full convolution network in the trained first depth neural network, so that compared with the prior art, the time consumption is reduced, and the calculation amount of a model is smaller.

The training process of the trained first deep neural network and the trained second deep neural network is described in detail below with reference to fig. 4, and with specific embodiments. Fig. 4 is a schematic flow chart of a training process of the first deep neural network and the second deep neural network according to an embodiment of the present invention, as shown in fig. 4, including:

s41, acquiring a sample image sequence and a sample labeling result, wherein the sample image sequence comprises a sample key frame image and a sample non-key frame image, and the sample labeling result comprises labeling information of semantic segmentation of the sample key frame image and the sample non-key frame image.

First, a sample image sequence and a sample labeling result are obtained, wherein the sample image sequence can be divided into a sample key frame image and a sample non-key frame image. The sample key frame image and the sample non-key frame image are distinguished, and a part of images in the sample image sequence can be set as sample key frame images, and the rest images are taken as sample non-key frame images. The number of image frames between every two adjacent sample key frame images may be the same or different.

After the sample image sequence is divided into a sample key frame image and a sample non-key frame image, a sample labeling result needs to be obtained, wherein the sample labeling result comprises labeling information of semantic segmentation of each frame of sample key frame image and labeling information of semantic segmentation of each frame of sample non-key frame image. The sample labeling result comprises a result of semantic segmentation of each frame of sample image, wherein the sample labeling result can be obtained by manually labeling the sample image or by labeling the sample image by means of a labeling tool.

S42, obtaining the trained first deep neural network according to the first deep neural network, the sample key frame and the labeling information of semantic segmentation of the sample key frame, wherein the first deep neural network is constructed by a first full convolution network and a first time sequence memory unit.

The existing segmentation methods such as K-means clustering and Grab-Cut mainly depend on low-level image pixel characteristics, and the characteristic expression capability cannot meet the application requirements of complex road scenes. Compared to existing segmentation methods, embodiments of the present invention employ full convolution networks to extract higher-level features, such as FCN, deepLab, segNet, etc., that are more characterizable. The timing Memory unit is used for linking semantic segmentation context information between frames before and after a video image sequence, in the embodiment of the invention, the timing Memory unit adopts a convolution long-short-Term Memory network (Convolution Long Short-Term Memory, hereinafter referred to as ConvLSTM), the ConvLSTM is based on a full-connection long-Term Memory network (Fully Connected Long Short-Term Memory, hereinafter referred to as FC-LSTM), input-to-state and state-to-state parts are replaced by convolution forms through feedforward calculation, the feedforward calculation of the FC-LSTM can pull features into one-dimensional vectors, so that space information is lost, and the ConvLSTM not only has the timing modeling capability of the LSTM, but also can characterize local features like CNN. Fig. 5 is a schematic diagram of the internal structure of the ConvLSTM according to the embodiment of the present invention, as shown in fig. 5, the working principle of the ConvLSTM is as follows:

/>

Where "×" denotes the convolution operation, "o" denotes the Hadamard matrix multiplication, i.e. the multiplication of the corresponding elements, i _t For inputting threshold, controlling input of characteristic, f _t As a feature threshold, indicating features that do not need to be passed on to the next frame of image, i _t And f _t Co-determination of C _t ，C _t Is the continuously memorized part in the image characteristics, namely the information of different characteristic parts in the front and back frame images, o _t For the output threshold, σ is the activation function.

The first deep neural network is constructed based on the first full convolutional network and the first timing memory unit, and a construction process of the first deep neural network is described below. For example, the first deep neural network may be designed on the basis of the residual network ResNet-34. ResNet-34 includes 5 convolution modules, 1 average pooling layer and 1 full connection layer, and FIG. 6 is a schematic structural diagram of a first deep neural network according to an embodiment of the present invention, as shown in FIG. 6, where the last average pooling layer and full connection layer are removed from the first deep neural convolutional network according to an embodiment of the present invention. The first convolution module comprises a convolution layer and a ReLU layer, the convolution kernel size is 7x7, the channel number is 64, and the interval between the pooling layers is 2; the second convolution module comprises three convolution layers and a ReLU layer, the convolution kernel size is 3x3, the channel number is 128, and the interval between the pooling layers is 2; the third convolution module comprises 4 residual modules, the convolution kernel size is 3x3, the channel number is 128, and the pooling layer interval is 2; the fourth convolution module comprises 6 residual modules, the convolution kernel size is 3x3, the channel number is 192, and the interval between the pooling layers is 2; the fifth convolution module contains 3 residual modules, the convolution kernel size is 3x3, the number of channels is 192, and the pooling layer interval is 2.

Firstly, inputting a sample memory unit state of a first sample image and the sample key frame image into the first depth neural network to obtain a first sample memory unit state of the sample key frame image and a first sample segmentation feature, wherein the first sample image is a previous frame image of the sample key frame image, and the sample memory unit state of the first sample image is used for indicating a difference feature between the first sample image and the previous frame image of the first sample image.

And obtaining a first loss function according to the labeling information of the semantic segmentation of the sample key frame image and the first sample segmentation feature.

In the training process, a sample key frame image t adopts a first deep neural network to carry out forward propagation once to obtain semantic segmentation features x _t ConvLSTM inputs the sample cell state (H) _t-1 ,C _t-1 ) And semantic segmentation feature x _t Outputs a first sample memory cell state (H _t ,C _t ) And a first sample segmentation feature o obtained by a memory unit _t . To o _t Up-sampling is performed to obtain a first sample segmentation map. Comparing the labeling information of the semantic segmentation of the sample key frame with the first sample segmentation map, and calculating the network parameters of the gradient update first deep neural network by taking the loss of the lovasz loss and the loss of the softmax loss as objective functions.

S43, processing the first deep neural network to obtain a second deep neural network.

The criterion of the second deep convolution network design is to reduce the calculation amount of the network as much as possible on the basis of guaranteeing the segmentation performance, and reduce the time consumption of the semantic segmentation process. In the design of the second deep neural network according to the embodiment of the present invention, compared with the first deep neural network, the number of channels in the layer number or each layer is appropriately cut, and fig. 7 is a schematic structural diagram of the second deep neural network according to the embodiment of the present invention, as shown in fig. 7, one of the settings adopted in the second deep neural network is to halve the number of channels of the first 4 convolution modules of the first full convolution network, so as to obtain a second full convolution network, and then based on the second full convolution network and a second time sequence memory unit, a second deep neural network is obtained, where the second time sequence memory unit is specifically a ConvLSTM.

S44, obtaining the trained second depth neural network according to the second depth neural network, the sample non-key frame image and the labeling information of the semantic segmentation of the sample non-key frame image.

Firstly, inputting a sample memory unit state of a second sample image and the sample non-key frame image into the second deep neural network to obtain a second sample memory unit state of the sample non-key frame image and a second sample segmentation feature, wherein the second sample image is a previous frame image of the sample non-key frame image, and the sample memory unit state of the second sample image is used for indicating a difference feature between the second sample image and the previous frame image of the second sample image;

Similarly, for the non-key frame t+1, forward propagation is performed once by using the second deep neural network, and the ConvLSTM unit inputs the sample memory cell state (H _t ,C _t ) And segmentation feature x _t+1 Outputs a second sample memory cell state (H _t+1 ,C _t+1 ) And a second sample segmentation feature o obtained by a memory unit _t+1 . The second deep neural network parameters are updated in a manner similar to the updating of the first deep neural network.

And respectively adding up-sampling operation on the basis of the first deep neural network and the second deep neural network to obtain a segmentation map with the same size as the original image, thereby obtaining the first deep neural network and the second deep neural network from end to end.

According to the video semantic segmentation method provided by the embodiment of the invention, firstly, an image sequence is acquired according to a video image, the image sequence is divided into a key frame image and a non-key frame image, then the key frame image is input into a trained first depth neural network to obtain a first semantic segmentation result, the non-key frame image is input into a trained second depth neural network to obtain a second semantic segmentation result, and finally, the semantic segmentation result of the video image is obtained according to the first semantic segmentation result and the second semantic segmentation result. In the training process of the first deep neural network and the second deep neural network, the last moment state of the long-short-term memory unit cache image sequence is added on the basis of the full convolution network, and partial features of the previous frame image are transferred to the next frame image, so that the segmentation performance of the single frame image can be improved by utilizing the sequence relation of the previous frame and the next frame. Meanwhile, the embodiment of the invention adopts the first depth neural network to extract the segmentation characteristic information of the key frame image, and the time sequence memory unit is used for assisting the second depth neural network to segment the non-key frame image, so that the first depth neural network and the second depth neural network are combined, the overall calculated amount of a video semantic segmentation task is reduced, and the time consumption is reduced.

The following describes the solution of the present application with a specific example.

Fig. 8 is a flow chart of a video semantic segmentation method according to another embodiment of the present invention, as shown in fig. 8, including:

s81, acquiring video images, and converting the video images into an image sequence.

The existing image semantic segmentation is usually aimed at a single frame image, but in a video image, each image is associated with the front frame image and the back frame image, and in an image sequence, some characteristics of the frames are identical, and the embodiment of the invention is realized based on the similarity of the characteristics of the front frame image and the back frame image.

S82, dividing the image sequence into a key frame image and a non-key frame image.

Specifically, for the above image sequence, a key frame image t may be set every fixed frame number k, and a non-key frame image is located between every two adjacent key frame images t.

S83, inputting the state of the first memory unit of the first image and the key frame image into the trained first deep neural network to obtain a first semantic segmentation result of the key frame image.

The trained first deep neural network comprises a first full convolutional network and a first time sequence memory unit, and the first time sequence memory unit is concretely ConvLSTM. And inputting a first memory unit state of a first image and the key frame image into the trained first depth neural network to obtain a first semantic segmentation result of the key frame image, wherein the first image is a previous frame image of the key frame image, and the first memory unit state is used for indicating the difference characteristics of the first image and the previous frame image of the first image.

Specifically, inputting the key frame image into the first full convolution network to obtain a corresponding first semantic segmentation feature;

inputting the first semantic segmentation feature and the first memory unit state of the first image into the first convolution long-short-term memory network to obtain a first semantic segmentation result of the key frame image.

S84, inputting the second memory unit state of the second image and the non-key frame image into the trained second deep neural network to obtain a second semantic segmentation result of the non-key frame image.

Specifically, inputting the non-key frame image into the second full convolution network to obtain a corresponding second semantic segmentation feature;

S85, up-sampling is carried out on the first semantic segmentation result to obtain a first segmentation map, the first segmentation map is consistent with the size of the key frame image, up-sampling is carried out on the second semantic segmentation result to obtain a second segmentation map, and the second segmentation map is consistent with the size of the non-key frame image.

The purpose of up-sampling is to enlarge the image so that it can be displayed on a higher resolution display device, in the embodiment of the invention, the first semantic segmentation result and the second semantic segmentation result are up-sampled, so as to obtain a first segmentation map consistent with the size of the key frame image and a second segmentation map consistent with the non-key frame image. And obtaining a semantic segmentation result of the video image according to the first segmentation map and the second segmentation map.

Fig. 9 is a schematic diagram of video semantic segmentation framework provided in an embodiment of the present invention, where, as shown in fig. 9, a video image is divided into image sequences (t, t+1, t+2, t+3.), where image t is a key frame image, and images t+1, t+2, and t+3 are non-key frame images. After inputting the key frame image t into a first full convolution network, obtaining a first semantic segmentation feature, inputting the first semantic segmentation feature and a first memory unit state of a first image into a first convolution long-short-period memory network, obtaining a first semantic segmentation result of the key frame image t and a memory unit state of the first convolution long-period memory network, wherein the memory unit state of the first convolution long-period memory network performs forward propagation once. After the non-key frame image t+1 is input into a second full convolution network to obtain a second semantic segmentation feature, the memory unit state of the first convolution long-term and short-term memory network and the second semantic segmentation feature are input into the second convolution long-term and short-term memory network to obtain a second semantic segmentation result of the non-key frame image t+1 and the memory unit state of the second convolution long-term and short-term memory network, wherein the memory unit state of the second convolution long-term and short-term memory network is subjected to forward propagation once. The convolution long-short-term memory network caches the characteristics of the corresponding image sequence, and partial characteristics in the front frame image and the rear frame image in the video sequence are consistent, so that semantic segmentation context information between the front frame image and the rear frame image of the video image sequence can be connected through the convolution long-short-term memory network and continuously propagated forwards, and therefore, partial characteristics of the front frame image can be transferred to the rear frame image. On the premise of this, the number of layers or the number of channels of each layer of the first deep neural network is cut to obtain a second deep neural network, and the extracted features are reduced, but because each frame of image receives part of the features of the previous frame of image, the method provided by the embodiment of the invention can not only reduce the calculation amount and time consumption of the network, but also ensure the performance of video semantic segmentation.

Fig. 10 is a schematic diagram of a segmentation result of a trained first depth neural network on a key frame image provided by an embodiment of the present invention, and fig. 11 is a schematic diagram of a segmentation result of a trained second depth neural network on a non-key frame image provided by an embodiment of the present invention, where in fig. 10 and fig. 11, the key frame image is a previous frame image of the non-key frame image, and when the semantic segmentation is performed, the segmentation result of the trained first depth neural network on the key frame image is transmitted to the trained second depth neural network to help the semantic segmentation of the second depth neural network on the non-key frame image.

Fig. 12 is a schematic structural diagram of a video semantic segmentation apparatus according to an embodiment of the present invention, as shown in fig. 12, including an acquisition module 121, a processing module 122, and a segmentation module 123, where:

the acquiring module 121 is configured to acquire an image sequence according to a video image, where the image sequence includes a key frame image and a non-key frame image;

the processing module 122 is configured to input the key frame image into a trained first deep neural network to obtain a first semantic segmentation result, and input the non-key frame image into a trained second deep neural network to obtain a second semantic segmentation result, where the trained first deep neural network includes a first full convolutional network, the trained second deep neural network includes a second full convolutional network, and the number of channels of the first full convolutional network is greater than the number of channels of the second full convolutional network;

the segmentation module 123 is configured to obtain a semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result.

In one possible implementation, the trained first deep neural network further includes a first timing memory unit, and the trained second deep neural network further includes a second timing memory unit; the processing module 122 is specifically configured to:

In one possible implementation, the first timing memory unit is specifically a first convolutional long-short-term memory network, and the processing module 122 is specifically further configured to:

In one possible implementation, the segmentation module 123 is specifically configured to:

The device provided by the embodiment of the invention can be used for executing the technical scheme of the embodiment of the method, and the implementation principle and the technical effect are similar, and are not repeated here.

Fig. 13 is a schematic hardware structure of a video semantic segmentation device according to an embodiment of the present invention, where, as shown in fig. 13, the video semantic segmentation device includes: at least one processor 131 and a memory 132. Wherein the processor 131 and the memory 132 are connected by a bus 133.

Optionally, the model determination further comprises a communication component. For example, the communication component may include a receiver and/or a transmitter.

In a specific implementation, at least one processor 131 executes computer-executable instructions stored in the memory 132, so that the at least one processor 131 performs the video semantic segmentation method as described above.

The specific implementation process of the processor 131 can be referred to the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.

In the embodiment shown in fig. 13, it should be understood that the processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (english: digital Signal Processor, abbreviated as DSP), application specific integrated circuits (english: application Specific Integrated Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

The memory may comprise high speed RAM memory or may further comprise non-volatile storage NVM, such as at least one disk memory.

The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (Peripheral Component, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or one type of bus.

The application also provides a computer readable storage medium having stored therein computer executable instructions that when executed by a processor implement the video semantic segmentation method as described above.

The computer readable storage medium described above may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. In the alternative, the readable storage medium may be integral to the processor. The processor and the readable storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short). The processor and the readable storage medium may reside as discrete components in a device.

The division of the units is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A method for semantic segmentation of video, comprising:

Inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the trained first depth neural network comprises a first full-convolution network, the trained second depth neural network comprises a second full-convolution network, the number of channels of the first full-convolution network is larger than that of the second full-convolution network, the trained first depth neural network further comprises a first time sequence memory unit, the trained second depth neural network further comprises a second time sequence memory unit, the first time sequence memory unit is specifically a first convolution long-term memory network, and the second time sequence memory unit is specifically a second convolution long-term memory network;

obtaining a semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result

Inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, and inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the method comprises the following steps:

2. The method of claim 1, wherein inputting the first memory cell state of the first image and the key frame image into the trained first deep neural network to obtain a first semantic segmentation result of the key frame image comprises:

inputting the second memory unit state of the second image and the non-key frame image into the trained second deep neural network to obtain a second semantic segmentation result of the non-key frame image, and further comprising:

3. The method of claim 1, wherein the obtaining the semantic segmentation result of the video image from the first semantic segmentation result and the second semantic segmentation result comprises:

4. The method of claim 1, wherein the trained first deep neural network and the trained second deep neural network are obtained by:

5. The method of claim 4, wherein the obtaining trained first depth neural network from the first depth neural network, the sample key frame, and the labeling information of the semantic segmentation of the sample key frame comprises:

6. The method of claim 4, wherein the obtaining trained second depth neural network from the second depth neural network, the sample non-key frame image, and the labeling information of the semantic segmentation of the sample non-key frame image comprises:

7. The method of claim 4, wherein the processing the first deep neural network to obtain a second deep neural network comprises:

8. A video semantic segmentation apparatus, comprising:

the processing module is used for inputting the key frame image into a trained first depth neural network to obtain a first semantic segmentation result, inputting the non-key frame image into a trained second depth neural network to obtain a second semantic segmentation result, wherein the trained first depth neural network comprises a first full-convolution network, the trained second depth neural network comprises a second full-convolution network, the channel number of the first full-convolution network is larger than that of the second full-convolution network, the trained first depth neural network further comprises a first time sequence memory unit, the trained second depth neural network further comprises a second time sequence memory unit, the first time sequence memory unit is a first convolution long-short-term memory network, and the second time sequence memory unit is a second convolution long-short-term memory network;

The segmentation module is used for obtaining the semantic segmentation result of the video image according to the first semantic segmentation result and the second semantic segmentation result;

the processing module is specifically configured to:

9. A video semantic segmentation device, comprising: at least one processor and memory;

The memory stores computer-executable instructions;

the at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the video semantic segmentation method of any one of claims 1 to 7.

10. A computer readable storage medium having stored therein computer executable instructions which, when executed by a processor, implement the video semantic segmentation method of any one of claims 1 to 7.