CN109919044A

CN109919044A - The video semanteme dividing method and device of feature propagation are carried out based on prediction

Info

Publication number: CN109919044A
Application number: CN201910120021.XA
Authority: CN
Inventors: 鲁继文; 周杰; 朱文成; 饶永铭
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2019-06-21

Abstract

The invention discloses a kind of video semanteme dividing methods and device that feature propagation is carried out based on prediction, wherein this method comprises: obtaining multiple key frames in video frame and multiple non-key frames according to the semantic difference of the neural network prediction video frame of shallow-layer；The high-order semantic feature of multiple non-key frames is predicted according to the timing information of high-order semantic feature according to the high-order semantic feature that picture semantic segmentation network obtains multiple key frames；High-order semantic feature and multiple non-key high-order semantic features to multiple key frames are classified, and sample default size, generate video semanteme segmentation result.This method do not need to make the assumption that high-order and low order feature, by predicting and fine tuning obtains video semanteme segmentation, can reduce the time complexity of algorithm under the premise of guaranteeing Video segmentation accuracy.

Description

The video semanteme dividing method and device of feature propagation are carried out based on prediction

Technical field

The present invention relates to video frame feature propagation technical field, in particular to a kind of view that feature propagation is carried out based on prediction Frequency semantic segmentation method and device.

Background technique

Feature propagation technology is of crucial importance the task of real-time.Feature propagation technology can be sharp again With obtained feature, and consider the continuity of sequence data in time, be propagated in the task of subsequent time, uses In the feature for obtaining the moment.Feature propagation technology can reduce the time complexity of sequence data feature acquisition significantly accordingly Degree guarantees the feature accuracy with higher obtained while considering feature timing information.Feature propagation technology can be used In video, the sequence datas task such as audio.In this patent, by taking the semantic segmentation task in video data as an example, to illustrate It is proposed that based on prediction carry out feature propagation video semanteme cutting techniques.

The main target of semantic segmentation task is that the scene in picture is divided into different semantic regions.Semantic segmentation at present It has been widely applied in various actual tasks, for example in automatic Pilot, has been partitioned into road, pedestrian, trees, sky, building The prospects such as object and background information；In medical image, it is partitioned into tumor region；In robot application, the perception thorough to scene Etc..

Semantic segmentation plays an important role profound scene understanding, compared to the semantic segmentation based on picture, Based on the semantic segmentation of video in practical application more extensively and naturally, because in reality using being all based on video mostly and It is not picture.

The target that video semanteme divides task is to obtain the semantic segmentation of each video frame.Directly picture semantic segmentation is answered Using each frame in video semanteme segmentation is a kind of direct method, but this method will lead to excessively high time complexity, It is unable to satisfy the requirement of real-time.This method does not account for the timing information of video frame yet simultaneously.For video semanteme point For cutting, the speed of video frame processing and accuracy are all critically important, and the semantic segmentation based on picture focuses on picture segmentation Accuracy.Video semanteme divides another difficult point and is a lack of enough labeled data, so that can not be to the language of each video frame Justice segmentation exercises supervision.Therefore it needs to consider to be directed to the video semanteme cutting techniques that video carries out semantic segmentation.

The research method of video semanteme segmentation at present can be roughly divided into two types: based on non-propagating method and based on biography The method broadcast.These two kinds of methods all utilize the continuity information of video frame, and first kind method is believed using the timing in video data It ceases to improve the segmentation precision of video frame；Second class method propagates high-order feature using timing information to reduce time complexity The precision of semantic segmentation is kept simultaneously.In the related technology in view of the addition of motion information in three-dimensional space will lead to consecutive frame it Between the discovery difficulty of pixel matching relationship increase, therefore introduce time and space constraint, optimization pixel to European feature The mapping relations in space are simultaneously modeled using condition random field to obtain semantic segmentation.The relevant technologies also utilize convolutional Neural net Network extracts the spatial information of video frame, and then using LSTM, (Long Short-Term Memory, shot and long term remember net on Network) temporal information between Lai Jianmo video frame, finally final semanteme is obtained using classifier and deconvolution neural network Segmentation.It calculates light stream between consecutive frame first in the related technology, then estimates its corresponding order of information using light stream, simultaneously The semantic segmentation calculated in conjunction with present frame obtains more accurate segmentation result.First kind method considers how mostly The precision for improving semantic segmentation does not consider how the time complexity for reducing algorithm.Second class method is mainly reduced and is calculated The time complexity of method, while the performance of algorithm is kept as far as possible.

For video semanteme segmentation, the order of information that video frame is obtained by multilayer depth convolutional network is especially to consume It is time taking.Therefore how to avoid the order of information for calculating each video frame is to reduce Algorithms T-cbmplexity and meet real-time Property require key.Simultaneously in view of in a video, the variation of adjacent video frame is very little, the language of corresponding high-order Adopted information gap can be smaller.Therefore the method based on propagation is suggested, and such methods mainly utilize existing high-order feature, And propagated or reused, it avoids computing repeatedly high-order feature.The relevant technologies divide full connection convolutional neural networks At differently submodule, by adaptively scheduling mode, with reusing certain submodules feature, although this mode can Recycling feature, but directly copy feature, ignore the difference between frame and frame.The relevant technologies propose one kind and are based on The method of light stream carrys out propagation characteristic rather than reuses feature, and this method assumes that low order feature and high-order feature share light stream Information.The low order Optical-flow Feature between adjacent two frame is calculated first, is then applied between corresponding high-order feature, is come Obtain the high-order feature of next frame.This method calculates high-order feature although avoiding, and is the introduction of the complexity for calculating light stream Degree.The relevant technologies also proposed the mode based on space invariance core, and what this method hypothesis low-order and high-order feature was directly shared is empty Between constant core rather than light stream, therefore avoid the complexity for introducing and calculating light stream.In the related technology by present frame and next The low order feature calculation of frame goes out space invariance core, is then applied to the high-order feature of present frame to obtain the high-order of next frame Furthermore feature also proposed a kind of adaptive key frame extraction strategy.

Other than above-mentioned two major classes method, there are also other certain methods to weigh computation complexity and accuracy.Phase Pass technology is made inferences in the semantic segmentation of image level using condition random field and can combine push away to video frame Reason.Infer that the semantic segmentation of video frame, this method have been verified to be relatively effective semantic segmentation mode by joint.Phase Pass technology proposes the semantic segmentation frame of cost-sensitive, this frame utilizes the subset of visual attention model selecting video frame It closes, and marks remaining video frame using interpolation model.Video frame is all attempted while being marked to both methods, but needs prior Entire video is obtained, therefore online video data cannot be handled well.

In addition, dividing task, the number that the data for Experiment Training and test have following two public about video semanteme According to collection, Cityscape and Camvid.Cityscape data set is established to improve the semantic understanding to City scenarios, it Include that wherein 19 classifications are used to evaluate the performance of semantic segmentation to 30 classifications, a total of 5000 fine labeled data with And 20000 thick flag datas.The resolution ratio of every picture is 1024*2048.Fine labeled data includes 2975 training Data, 500 verify datas and 1525 test datas.Every mark image is all from a continuous 30 frame video trifle 20th frame picture of middle selection.Camvid data set be it is based drive segmentation and identification data set include 701 pictures, every The resolution ratio of picture is 720*960.Data set includes that wherein 11 classifications are used to evaluate 32 classifications.Data set divided into for Training set, verifying collection and test set, have separately included 367,100,233 pictures.

It in the related technology, is directly to be used to the semantic segmentation method migration based on picture to handle video, this kind of side first Method has higher accuracy, but needs to handle each frame of video, and such methods can not utilize the timing information of video, Therefore the time complexity of algorithm is often relatively high.Although the method for non-propagating feature can obtain excellent performance, this Class method mostly only focuses on the accuracy of algorithm, ignores the efficiency of algorithm.Method based on propagation can be good at guaranteeing Reduce algorithm complexity while algorithm performance, but such methods have very strong a priori assumption, it is believed that high-order feature and Low order feature shared structure information such as light stream and core, thus have ignored existing semanteme between high-order feature and low order feature Wide gap.The method that other methods of video segmentation such as mark simultaneously, needs to obtain entire video in advance, so as to supervise mutually It superintends and directs, but is often unable to get entire video in practical application.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide a kind of video semanteme segmentation sides for carrying out feature propagation based on prediction Method, this method do not need to make the assumption that high-order and low order feature, by predicting and fine tuning obtains video semanteme segmentation.

It is another object of the present invention to propose a kind of video semanteme segmenting device that feature propagation is carried out based on prediction.

In order to achieve the above objectives, one aspect of the present invention embodiment proposes a kind of video that feature propagation is carried out based on prediction Semantic segmentation method, comprising: according to the semantic difference of the neural network prediction video frame of shallow-layer, obtain more in the video frame A key frame and multiple non-key frames；Divide the high-order semantic feature that network obtains the multiple key frame according to picture semantic, According to the timing information of high-order semantic feature, the high-order semantic feature of the multiple non-key frame is predicted；To the multiple pass The high-order semantic feature of key frame and the multiple non-key high-order semantic feature are classified, and sample default size, raw At video semanteme segmentation result.

The video semanteme dividing method that feature propagation is carried out based on prediction of the embodiment of the present invention is passed through adaptively first Key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame and this key Frame has Semantic Similarity, and furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains, can be with Ensure accuracy of the video semanteme segmentation by prediction；It can be believed using the timing of high-order feature secondly by the method for prediction Breath, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature；It is obtained by prediction High-order feature, although main semantic information in video frame can be obtained, being the absence of low order information has fine sky Between positional relationship and marginal information.By merging low order feature, high-order can be helped by being finely adjusted to the high-order feature of prediction Feature obtains finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.

In addition, the video semanteme dividing method according to the above embodiment of the present invention for carrying out feature propagation based on prediction may be used also With following additional technical characteristic:

Further, in one embodiment of the invention, further includes: generate and the training picture semantic divides net Network carries out semantic segmentation with each frame to the video frame.

Further, in one embodiment of the invention, the language of the neural network prediction video frame according to shallow-layer Adopted difference obtains multiple key frames in the video frame and multiple non-key frames, specifically includes:

Judge whether the semantic difference of present frame and previous key frame is greater than preset threshold；

If it is greater than the preset threshold, then present frame is new key frame, and otherwise, present frame is non-key frame.

Further, in one embodiment of the invention, described the multiple according to picture semantic segmentation network acquisition The high-order semantic feature of key frame predicts the high-order of the multiple non-key frame according to the timing information of high-order semantic feature Semantic feature specifically includes:

If present frame is key frame, directly acquired currently according to the semantic segmentation technology that the picture semantic divides network The high-order semantic feature of frame；

If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the height of the key frame of previous frame The high-order semantic feature of rank semantic feature prediction present frame；

If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the front cross frame of present frame High-order semantic feature prediction present frame high-order semantic feature.

Further, in one embodiment of the invention, further includes:

The low order feature of the multiple non-key frame is generated by the space component that the picture semantic divides network, It is adjusted, is obtained accurately described according to high-order semantic feature of the low order feature to the multiple non-key frame of prediction The high-order semantic feature of multiple non-key frames.

In order to achieve the above objectives, another aspect of the present invention embodiment proposes a kind of view that feature propagation is carried out based on prediction Frequency semantic segmentation device, comprising: first obtains module, for the semantic difference according to the neural network prediction video frame of shallow-layer, Obtain multiple key frames in the video frame and multiple non-key frames；Second obtains module, for being divided according to picture semantic Network obtains the high-order semantic feature of the multiple key frame, according to the timing information of high-order semantic feature, predicts described more The high-order semantic feature of a non-key frame；Divide module, for high-order semantic feature to the multiple key frame and described more A non-key high-order semantic feature is classified, and samples default size, generates video semanteme segmentation result.

The video semanteme segmenting device that feature propagation is carried out based on prediction of the embodiment of the present invention is passed through adaptively first Key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame and this key Frame has Semantic Similarity, and furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains, can be with Ensure accuracy of the video semanteme segmentation by prediction；It can be believed using the timing of high-order feature secondly by the method for prediction Breath, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature；It is obtained by prediction High-order feature, although main semantic information in video frame can be obtained, being the absence of low order information has fine sky Between positional relationship and marginal information.By merging low order feature, high-order can be helped by being finely adjusted to the high-order feature of prediction Feature obtains finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.

In addition, the video semanteme segmenting device according to the above embodiment of the present invention for carrying out feature propagation based on prediction may be used also With following additional technical characteristic:

Further, in one embodiment of the invention, further includes: training module,

The training module, for generating and training the picture semantic segmentation network, to each of the video frame Frame carries out semantic segmentation.

Further, in one embodiment of the invention, described first module is obtained, comprising:

Judging unit, for judging whether the semantic difference of present frame and previous key frame is greater than preset threshold；

Confirmation unit, when being greater than the preset threshold for the semantic difference of present frame and previous key frame, confirmation is worked as Previous frame is new key frame, and otherwise, present frame is non-key frame.

Further, in one embodiment of the invention, the second acquisition module is specifically used for,

Further, in one embodiment of the invention, further includes: adjustment module,

The adjustment module, the space component for dividing network by the picture semantic generate the multiple non- The low order feature of key frame is adjusted according to high-order semantic feature of the low order feature to the multiple non-key frame of prediction It is whole, obtain the high-order semantic feature of accurate the multiple non-key frame.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is the comparison diagram according to one embodiment of the invention and traditional transmission method；

Fig. 2 is the video semanteme dividing method process that feature propagation is carried out based on prediction according to one embodiment of the invention Figure；

Fig. 3 is the timing flow chart according to one embodiment of the invention；

Fig. 4 is the flow chart according to the high-order feature of one embodiment of the invention in timing；

Fig. 5 is the key frame extraction result figure according to one embodiment of the invention；

Fig. 6 is the video semanteme segmenting device structure that feature propagation is carried out based on prediction according to one embodiment of the invention Figure.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

The video semanteme that feature propagation is carried out based on prediction proposed according to embodiments of the present invention is described with reference to the accompanying drawings Dividing method and device.

The video language that feature propagation is carried out based on prediction proposed according to embodiments of the present invention is described with reference to the accompanying drawings first Adopted dividing method.

As shown in Figure 1, illustrating the comparison diagram of the embodiment of the present invention and traditional transmission method.Wherein Fig. 1 (a) is base In the picture semantic segmentation of connection convolutional network entirely；Fig. 1 (b) and Fig. 1 (c) are respectively shown based on light stream and space invariance core Feature propagation method, Fig. 1 (d) is the method that the embodiment of the present invention is proposed.The embodiment of the present invention is first with high-order language The timing information of adopted feature obtains the high-order semantic feature of present frame by way of prediction, then uses low order position and line Feature is managed to adjust the high-order semantic feature of prediction, so that final high-order semantic feature not only has abstract semantic feature also The spatial informations such as position and texture are merged.The advantages of this method is not need to make the assumption that high-order and low order feature, is led to It crosses prediction and fine tuning obtains video semanteme segmentation.

Fig. 2 is the video semanteme dividing method process that feature propagation is carried out based on prediction according to one embodiment of the invention Figure.

As shown in Fig. 2, should based on prediction carry out feature propagation video semanteme dividing method the following steps are included:

In step s101, it according to the semantic difference of the neural network prediction video frame of shallow-layer, obtains more in video frame A key frame and multiple non-key frames.

Further, in one embodiment of the invention, poor according to the semanteme of the neural network prediction video frame of shallow-layer It is different, multiple key frames in video frame and multiple non-key frames are obtained, is specifically included:

If it is greater than preset threshold, then present frame is new key frame, and otherwise, present frame is non-key frame.

Specifically, the selection of key frame can guarantee the compactness between video frame, that is, the video between key frame Frame has the similitude of height, guarantees the accuracy of high-order feature by way of predicting or propagating.Traditional key frame choosing Taking method is to determine a key frame every fixed frame number, and this mode is unable to the semantic difference of reflecting video frame.

Further, key frame, while the height of non-key frame are chosen using a kind of adaptive key frame extraction method Rank semantic information is to propagate to obtain by the high-order semantic information of previous key frame, can ensure proposed video semanteme in this way The accuracy of segmentation.This method can reduce the time complexity of algorithm under the premise of guaranteeing Video segmentation accuracy.

The embodiment of the present invention is using the neural network of a shallow-layer come the abstract semantics difference of predicted video frame, dynamic choosing Take key frame.Using the convolution kernel in two layers of 256 channels, global mean value pond layer and a full articulamentum are returned between video frame The deviation of high-order semantic information.The input of network is the low order that two video frames pass through that the space branch of basic network obtains respectively Feature, the output of network are semantic deviations corresponding to the two video frames.

If previous key frame and present frame are more than a threshold value by the deviation of the high-order semantic information of neural network forecast, that Being judged as the frame is a new key frame, and otherwise deciding that the frame not is a key frame.Fig. 5 illustrates key frame extraction As a result, being wherein all judged to key frame more than the frame of threshold value.With between present frame and a upper key frame, frame number difference increases Add, high-order semantic information deviation is also gradually increased, and when semantic deviation between the two is greater than a threshold value, determines to work as Previous frame is new key frame.If deviation between the two is less than threshold value, determine that present frame is non-key frame.Key frame and non-pass Key frame chooses the acquisition that will have a direct impact on high-order feature.This key frame extraction method ensure that the video frame between key frame Similitude with height, it is reasonable for also ensuring that the mode of prediction obtains video frame high-order semantic information.

In step s 102, the high-order semantic feature that network obtains multiple key frames is divided according to picture semantic, according to height The timing information of rank semantic feature predicts the high-order semantic feature of multiple non-key frames.

Further, in one embodiment of the invention, further includes: it generates and training picture semantic divides network, with Semantic segmentation is carried out to each frame of video frame.

Specifically, video semanteme partitioning algorithm needs to propagate accurate high-order feature, to obtain accurate prediction result. It is trained that the method for the embodiment of the present invention needs to obtain one in advance, and can carry out semantic segmentation to each frame of video frame Network, based on network extract the high-order feature of key frame.BiSeNet frame is chosen in embodiments of the present invention to make Based on picture semantic divide network, this network is there are two branch: first branch is space branch, for extracting picture Low order space and position feature；Another branch is context semanteme branch, for obtaining the context semanteme letter of picture Breath.Final high-order feature is obtained by fusion space branch and context semanteme branch.Select the original of BiSeNet frame Because being that the feature that two branch extracts has complementarity, the high-order semantic information of video frame has similitude, can be by pre- The mode of survey obtains the main semantic information of video frame, and specific position and spatial information, which can merge low order feature, to be come It arrives, can be to avoid calculating high-order feature due to this two-part complementarity, while there is context semantic information and space letter Breath.Furthermore the two branches can effectively realize parallel on hardware device.

Further, in one embodiment of the invention, network is divided according to picture semantic and obtains multiple key frames High-order semantic feature predicts the high-order semantic feature of multiple non-key frames, specifically according to the timing information of high-order semantic feature Include:

If present frame is key frame, present frame is directly acquired according to the semantic segmentation technology that picture semantic divides network High-order semantic feature；

Further, it predicts to obtain the high-order semanteme letter of present frame by the front cross frame high-order semantic information of present frame The i.e. high-order feature of relationship of the high-order feature in timing can be utilized to avoid calculating the high-order feature of present frame in breath Motion information.

Specifically, as shown in figures 1 and 3, for the video frame of t moment, if it is judged as key frame, then passing through Underlying semantics cutting techniques obtain its accurate high-order feature F^t, it should be noted that the high-order feature of key frame is not prediction It obtains, but passes through entire basic network；If the video frame at t+1 moment is also judged as key frame, the t+1 moment is high Rank feature obtains accurate high-order feature by entire basic network as t moment, and otherwise the video frame at t+1 moment is not It is judged as key frame, then needing with the high-order feature of t moment come the high-order feature to t+1 moment video frameIt carries out pre- It surveys, while the low order feature obtained according to the space component of basic networkCome the high-order feature obtained to prediction Fusion fine tuning is carried out, finer space and location information are made it have.Finally obtain the high dimensional feature F at t+1 moment^t+1.It is logical The continuity of video data in time is utilized in the high-order feature for crossing the model prediction non-key frame.

Method based on feature propagation thinks such as the similar high-order feature more phase so corresponding to them of two frames of K fruit Seemingly:

f(x₁₎-f(x₂)|<K|x₁-x₂|

Wherein x₁And x₂It is two given pictures, f (x₁₎With f (x₂) it is its corresponding high-order feature, K is a constant. A video frame is given, if it is considered as key frame by previous stage S0, then this picture is needed through entire basis Semantic segmentation network obtains its high-order feature F^t。

Further, in one embodiment of the invention, further includes: divide the space branch of network by picture semantic Part generates the low order feature of multiple non-key frames, according to low order feature to the high-order semantic feature of multiple non-key frames of prediction It is adjusted, obtains the high-order semantic feature of accurate multiple non-key frames.

Wherein, low order feature includes low order position and the textural characteristics of non-key frame, includes more spatial positional informations And fine marginal information, these information help to obtain more accurate semantic segmentation.

Further, although the high-order semantic information obtained by predicted current frame can obtain the main language of present frame Adopted feature, but the spatial informations such as position and texture for being the absence of picture.The method of the present embodiment needs to merge the low of present frame Rank feature is adjusted the high-order semantic feature of prediction, so that the finally obtained high-order feature of present frame includes more empty Between location information and fine marginal information, these information help to obtain more accurate semantic segmentation.

If a video frame is not considered as key frame, then being predicted first by the high-order feature of its front cross frame current The high-order feature of frame, while the video frame passes through the space branch of basic network, obtains its low order feature.Utilize low order feature pair The high-order feature of prediction carries out fusion fine tuning, to obtain having the high-order feature of contextual information and spatial information.

Fig. 4 illustrates flow chart of the high-order feature in timing, and for key frame t, high-order feature can be passed to Non-key frame t+1 and the t+2 moment, for predicting respective high-order feature.High-order feature is obtained by way of prediction two Kind mode, first way is present frame followed by key frame, then can only be predicted using the high-order feature of key frame；The It is not key frame that two kinds of modes, which are present frames, and not the next frame of key frame, then can use the front cross frame of present frame High-order feature predicted.Due to the similitude between video frame, the difficulty of prediction high-order feature is reduced.The present invention selects The network of shallow-layer predicts high-order feature.This shallow-layer network includes 2 blocks, wherein each piece includes 2 convolution Layer, one BN layers and one Relu layers, each piece is combined by ResBlock structure.

In step s 103, the high-order semantic feature of multiple key frames and multiple non-key high-order semantic features are carried out Classification, and default size is sampled, generate video semanteme segmentation result.

Specifically, the high-order feature for obtaining each video frame needs to classify to it later, using softmax to height Rank feature is classified, and original picture size is then upsampled to, and finally obtains the result of video semanteme segmentation.

What is proposed according to embodiments of the present invention carries out the video semanteme dividing method of feature propagation based on prediction, passes through first Adaptive key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame There is Semantic Similarity with this key frame, furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains It arrives, it can be ensured that the accuracy that video semanteme segmentation passes through prediction；High-order feature can be utilized secondly by the method for prediction Timing information, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature；By pre- The high-order feature measured, although main semantic information in video frame can be obtained, being the absence of low order information has essence Thin spatial relation and marginal information.By merging low order feature, the high-order feature of prediction is finely adjusted and can be helped High-order feature is helped to obtain finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.

The video semanteme that feature propagation is carried out based on prediction proposed according to embodiments of the present invention referring next to attached drawing description Segmenting device.

Fig. 6 is the video semanteme segmenting device structure that feature propagation is carried out based on prediction according to one embodiment of the invention Schematic diagram.

As shown in fig. 6, the video semanteme segmenting device 10 includes: that the first acquisition module 100, second obtains 200 and of module Divide module 300.

Wherein, the first acquisition module 100 is used for the semantic difference of the neural network prediction video frame according to shallow-layer, obtains view Multiple key frames and multiple non-key frames in frequency frame.

Second acquisition module 200 is used to divide the high-order semantic feature that network obtains multiple key frames according to picture semantic, According to the timing information of high-order semantic feature, the high-order semantic feature of multiple non-key frames is predicted.

Divide module 300 be used for the high-order semantic features of multiple key frames and multiple non-key high-order semantic features into Row classification, and default size is sampled, generate video semanteme segmentation result.

The video semanteme segmenting device 10 do not need to make the assumption that high-order and low order feature, by predicting and fine tuning obtains Video semanteme segmentation.

Further, in one embodiment of the invention, further includes: training module,

Training module carries out semantic point for generating and training picture semantic to divide network with each frame to video frame It cuts.

Further, in one embodiment of the invention, first module is obtained, comprising:

Confirmation unit when being greater than preset threshold for the semantic difference of present frame and previous key frame, confirms present frame For new key frame, otherwise, present frame is non-key frame.

Further, in one embodiment of the invention, further includes: adjustment module, for being divided by picture semantic The space component of network generates the low order feature of multiple non-key frames, according to low order feature to multiple non-key frames of prediction High-order semantic feature be adjusted, obtain the high-order semantic feature of accurate multiple non-key frames.

It should be noted that the aforementioned explanation to the video semanteme dividing method embodiment for carrying out feature propagation based on prediction Illustrate the device for being also applied for the embodiment, details are not described herein again.

What is proposed according to embodiments of the present invention carries out the video semanteme segmenting device of feature propagation based on prediction, passes through first Adaptive key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame There is Semantic Similarity with this key frame, furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains It arrives, it can be ensured that the accuracy that video semanteme segmentation passes through prediction；High-order feature can be utilized secondly by the method for prediction Timing information, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature；By pre- The high-order feature measured, although main semantic information in video frame can be obtained, being the absence of low order information has essence Thin spatial relation and marginal information.By merging low order feature, the high-order feature of prediction is finely adjusted and can be helped High-order feature is helped to obtain finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims

1. a kind of video semanteme dividing method for carrying out feature propagation based on prediction, which comprises the following steps:

According to the semantic difference of the neural network prediction video frame of shallow-layer, multiple key frames in the video frame and multiple are obtained Non-key frame；

Divide network according to picture semantic and obtain the high-order semantic feature of the multiple key frame, according to high-order semantic feature when Sequence information predicts the high-order semantic feature of the multiple non-key frame；

High-order semantic feature and the multiple non-key high-order semantic feature to the multiple key frame are classified, and are adopted Sample generates video semanteme segmentation result to default size.

2. the method according to claim 1, wherein further include:

The picture semantic segmentation network is generated and trained, semantic segmentation is carried out with each frame to the video frame.

3. the method according to claim 1, wherein the language of the neural network prediction video frame according to shallow-layer Adopted difference obtains multiple key frames in the video frame and multiple non-key frames, specifically includes:

4. the method according to claim 1, wherein described the multiple according to picture semantic segmentation network acquisition The high-order semantic feature of key frame predicts the high-order of the multiple non-key frame according to the timing information of high-order semantic feature Semantic feature specifically includes:

If present frame is key frame, present frame is directly acquired according to the semantic segmentation technology that the picture semantic divides network High-order semantic feature；

If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the high-order language of the key frame of previous frame The high-order semantic feature of adopted feature prediction present frame；

If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the height of the front cross frame of present frame The high-order semantic feature of rank semantic feature prediction present frame.

5. the method according to claim 1, wherein further include:

The low order feature that the multiple non-key frame is generated by the space component that the picture semantic divides network, according to The low order feature is adjusted the high-order semantic feature of the multiple non-key frame of prediction, obtains accurately the multiple The high-order semantic feature of non-key frame.

6. a kind of video semanteme segmenting device for carrying out feature propagation based on prediction characterized by comprising

First obtains module, for the semantic difference according to the neural network prediction video frame of shallow-layer, obtains in the video frame Multiple key frames and multiple non-key frames；

Second obtains module, for dividing the high-order semantic feature that network obtains the multiple key frame, root according to picture semantic According to the timing information of high-order semantic feature, the high-order semantic feature of the multiple non-key frame is predicted；

Divide module, for the high-order semantic feature and the multiple non-key high-order semantic feature to the multiple key frame Classify, and sample default size, generates video semanteme segmentation result.

7. device according to claim 6, which is characterized in that further include: training module,

The training module, for generating and training picture semantic segmentation network, with each frame to the video frame into Row semantic segmentation.

8. device according to claim 6, which is characterized in that described first obtains module, comprising:

Confirmation unit when being greater than the preset threshold for the semantic difference of present frame and previous key frame, confirms present frame For new key frame, otherwise, present frame is non-key frame.

9. device according to claim 6, which is characterized in that the second acquisition module is specifically used for,

10. device according to claim 6, which is characterized in that further include: adjustment module,

The adjustment module, the space component for dividing network by the picture semantic generate the multiple non-key The low order feature of frame is adjusted according to high-order semantic feature of the low order feature to the multiple non-key frame of prediction, Obtain the high-order semantic feature of accurate the multiple non-key frame.