CN109919044A - The video semanteme dividing method and device of feature propagation are carried out based on prediction - Google Patents
The video semanteme dividing method and device of feature propagation are carried out based on prediction Download PDFInfo
- Publication number
- CN109919044A CN109919044A CN201910120021.XA CN201910120021A CN109919044A CN 109919044 A CN109919044 A CN 109919044A CN 201910120021 A CN201910120021 A CN 201910120021A CN 109919044 A CN109919044 A CN 109919044A
- Authority
- CN
- China
- Prior art keywords
- frame
- feature
- order
- semantic
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a kind of video semanteme dividing methods and device that feature propagation is carried out based on prediction, wherein this method comprises: obtaining multiple key frames in video frame and multiple non-key frames according to the semantic difference of the neural network prediction video frame of shallow-layer;The high-order semantic feature of multiple non-key frames is predicted according to the timing information of high-order semantic feature according to the high-order semantic feature that picture semantic segmentation network obtains multiple key frames;High-order semantic feature and multiple non-key high-order semantic features to multiple key frames are classified, and sample default size, generate video semanteme segmentation result.This method do not need to make the assumption that high-order and low order feature, by predicting and fine tuning obtains video semanteme segmentation, can reduce the time complexity of algorithm under the premise of guaranteeing Video segmentation accuracy.
Description
Technical field
The present invention relates to video frame feature propagation technical field, in particular to a kind of view that feature propagation is carried out based on prediction
Frequency semantic segmentation method and device.
Background technique
Feature propagation technology is of crucial importance the task of real-time.Feature propagation technology can be sharp again
With obtained feature, and consider the continuity of sequence data in time, be propagated in the task of subsequent time, uses
In the feature for obtaining the moment.Feature propagation technology can reduce the time complexity of sequence data feature acquisition significantly accordingly
Degree guarantees the feature accuracy with higher obtained while considering feature timing information.Feature propagation technology can be used
In video, the sequence datas task such as audio.In this patent, by taking the semantic segmentation task in video data as an example, to illustrate
It is proposed that based on prediction carry out feature propagation video semanteme cutting techniques.
The main target of semantic segmentation task is that the scene in picture is divided into different semantic regions.Semantic segmentation at present
It has been widely applied in various actual tasks, for example in automatic Pilot, has been partitioned into road, pedestrian, trees, sky, building
The prospects such as object and background information;In medical image, it is partitioned into tumor region;In robot application, the perception thorough to scene
Etc..
Semantic segmentation plays an important role profound scene understanding, compared to the semantic segmentation based on picture,
Based on the semantic segmentation of video in practical application more extensively and naturally, because in reality using being all based on video mostly and
It is not picture.
The target that video semanteme divides task is to obtain the semantic segmentation of each video frame.Directly picture semantic segmentation is answered
Using each frame in video semanteme segmentation is a kind of direct method, but this method will lead to excessively high time complexity,
It is unable to satisfy the requirement of real-time.This method does not account for the timing information of video frame yet simultaneously.For video semanteme point
For cutting, the speed of video frame processing and accuracy are all critically important, and the semantic segmentation based on picture focuses on picture segmentation
Accuracy.Video semanteme divides another difficult point and is a lack of enough labeled data, so that can not be to the language of each video frame
Justice segmentation exercises supervision.Therefore it needs to consider to be directed to the video semanteme cutting techniques that video carries out semantic segmentation.
The research method of video semanteme segmentation at present can be roughly divided into two types: based on non-propagating method and based on biography
The method broadcast.These two kinds of methods all utilize the continuity information of video frame, and first kind method is believed using the timing in video data
It ceases to improve the segmentation precision of video frame;Second class method propagates high-order feature using timing information to reduce time complexity
The precision of semantic segmentation is kept simultaneously.In the related technology in view of the addition of motion information in three-dimensional space will lead to consecutive frame it
Between the discovery difficulty of pixel matching relationship increase, therefore introduce time and space constraint, optimization pixel to European feature
The mapping relations in space are simultaneously modeled using condition random field to obtain semantic segmentation.The relevant technologies also utilize convolutional Neural net
Network extracts the spatial information of video frame, and then using LSTM, (Long Short-Term Memory, shot and long term remember net on
Network) temporal information between Lai Jianmo video frame, finally final semanteme is obtained using classifier and deconvolution neural network
Segmentation.It calculates light stream between consecutive frame first in the related technology, then estimates its corresponding order of information using light stream, simultaneously
The semantic segmentation calculated in conjunction with present frame obtains more accurate segmentation result.First kind method considers how mostly
The precision for improving semantic segmentation does not consider how the time complexity for reducing algorithm.Second class method is mainly reduced and is calculated
The time complexity of method, while the performance of algorithm is kept as far as possible.
For video semanteme segmentation, the order of information that video frame is obtained by multilayer depth convolutional network is especially to consume
It is time taking.Therefore how to avoid the order of information for calculating each video frame is to reduce Algorithms T-cbmplexity and meet real-time
Property require key.Simultaneously in view of in a video, the variation of adjacent video frame is very little, the language of corresponding high-order
Adopted information gap can be smaller.Therefore the method based on propagation is suggested, and such methods mainly utilize existing high-order feature,
And propagated or reused, it avoids computing repeatedly high-order feature.The relevant technologies divide full connection convolutional neural networks
At differently submodule, by adaptively scheduling mode, with reusing certain submodules feature, although this mode can
Recycling feature, but directly copy feature, ignore the difference between frame and frame.The relevant technologies propose one kind and are based on
The method of light stream carrys out propagation characteristic rather than reuses feature, and this method assumes that low order feature and high-order feature share light stream
Information.The low order Optical-flow Feature between adjacent two frame is calculated first, is then applied between corresponding high-order feature, is come
Obtain the high-order feature of next frame.This method calculates high-order feature although avoiding, and is the introduction of the complexity for calculating light stream
Degree.The relevant technologies also proposed the mode based on space invariance core, and what this method hypothesis low-order and high-order feature was directly shared is empty
Between constant core rather than light stream, therefore avoid the complexity for introducing and calculating light stream.In the related technology by present frame and next
The low order feature calculation of frame goes out space invariance core, is then applied to the high-order feature of present frame to obtain the high-order of next frame
Furthermore feature also proposed a kind of adaptive key frame extraction strategy.
Other than above-mentioned two major classes method, there are also other certain methods to weigh computation complexity and accuracy.Phase
Pass technology is made inferences in the semantic segmentation of image level using condition random field and can combine push away to video frame
Reason.Infer that the semantic segmentation of video frame, this method have been verified to be relatively effective semantic segmentation mode by joint.Phase
Pass technology proposes the semantic segmentation frame of cost-sensitive, this frame utilizes the subset of visual attention model selecting video frame
It closes, and marks remaining video frame using interpolation model.Video frame is all attempted while being marked to both methods, but needs prior
Entire video is obtained, therefore online video data cannot be handled well.
In addition, dividing task, the number that the data for Experiment Training and test have following two public about video semanteme
According to collection, Cityscape and Camvid.Cityscape data set is established to improve the semantic understanding to City scenarios, it
Include that wherein 19 classifications are used to evaluate the performance of semantic segmentation to 30 classifications, a total of 5000 fine labeled data with
And 20000 thick flag datas.The resolution ratio of every picture is 1024*2048.Fine labeled data includes 2975 training
Data, 500 verify datas and 1525 test datas.Every mark image is all from a continuous 30 frame video trifle
20th frame picture of middle selection.Camvid data set be it is based drive segmentation and identification data set include 701 pictures, every
The resolution ratio of picture is 720*960.Data set includes that wherein 11 classifications are used to evaluate 32 classifications.Data set divided into for
Training set, verifying collection and test set, have separately included 367,100,233 pictures.
It in the related technology, is directly to be used to the semantic segmentation method migration based on picture to handle video, this kind of side first
Method has higher accuracy, but needs to handle each frame of video, and such methods can not utilize the timing information of video,
Therefore the time complexity of algorithm is often relatively high.Although the method for non-propagating feature can obtain excellent performance, this
Class method mostly only focuses on the accuracy of algorithm, ignores the efficiency of algorithm.Method based on propagation can be good at guaranteeing
Reduce algorithm complexity while algorithm performance, but such methods have very strong a priori assumption, it is believed that high-order feature and
Low order feature shared structure information such as light stream and core, thus have ignored existing semanteme between high-order feature and low order feature
Wide gap.The method that other methods of video segmentation such as mark simultaneously, needs to obtain entire video in advance, so as to supervise mutually
It superintends and directs, but is often unable to get entire video in practical application.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, an object of the present invention is to provide a kind of video semanteme segmentation sides for carrying out feature propagation based on prediction
Method, this method do not need to make the assumption that high-order and low order feature, by predicting and fine tuning obtains video semanteme segmentation.
It is another object of the present invention to propose a kind of video semanteme segmenting device that feature propagation is carried out based on prediction.
In order to achieve the above objectives, one aspect of the present invention embodiment proposes a kind of video that feature propagation is carried out based on prediction
Semantic segmentation method, comprising: according to the semantic difference of the neural network prediction video frame of shallow-layer, obtain more in the video frame
A key frame and multiple non-key frames;Divide the high-order semantic feature that network obtains the multiple key frame according to picture semantic,
According to the timing information of high-order semantic feature, the high-order semantic feature of the multiple non-key frame is predicted;To the multiple pass
The high-order semantic feature of key frame and the multiple non-key high-order semantic feature are classified, and sample default size, raw
At video semanteme segmentation result.
The video semanteme dividing method that feature propagation is carried out based on prediction of the embodiment of the present invention is passed through adaptively first
Key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame and this key
Frame has Semantic Similarity, and furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains, can be with
Ensure accuracy of the video semanteme segmentation by prediction;It can be believed using the timing of high-order feature secondly by the method for prediction
Breath, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature;It is obtained by prediction
High-order feature, although main semantic information in video frame can be obtained, being the absence of low order information has fine sky
Between positional relationship and marginal information.By merging low order feature, high-order can be helped by being finely adjusted to the high-order feature of prediction
Feature obtains finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.
In addition, the video semanteme dividing method according to the above embodiment of the present invention for carrying out feature propagation based on prediction may be used also
With following additional technical characteristic:
Further, in one embodiment of the invention, further includes: generate and the training picture semantic divides net
Network carries out semantic segmentation with each frame to the video frame.
Further, in one embodiment of the invention, the language of the neural network prediction video frame according to shallow-layer
Adopted difference obtains multiple key frames in the video frame and multiple non-key frames, specifically includes:
Judge whether the semantic difference of present frame and previous key frame is greater than preset threshold;
If it is greater than the preset threshold, then present frame is new key frame, and otherwise, present frame is non-key frame.
Further, in one embodiment of the invention, described the multiple according to picture semantic segmentation network acquisition
The high-order semantic feature of key frame predicts the high-order of the multiple non-key frame according to the timing information of high-order semantic feature
Semantic feature specifically includes:
If present frame is key frame, directly acquired currently according to the semantic segmentation technology that the picture semantic divides network
The high-order semantic feature of frame;
If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the height of the key frame of previous frame
The high-order semantic feature of rank semantic feature prediction present frame;
If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the front cross frame of present frame
High-order semantic feature prediction present frame high-order semantic feature.
Further, in one embodiment of the invention, further includes:
The low order feature of the multiple non-key frame is generated by the space component that the picture semantic divides network,
It is adjusted, is obtained accurately described according to high-order semantic feature of the low order feature to the multiple non-key frame of prediction
The high-order semantic feature of multiple non-key frames.
In order to achieve the above objectives, another aspect of the present invention embodiment proposes a kind of view that feature propagation is carried out based on prediction
Frequency semantic segmentation device, comprising: first obtains module, for the semantic difference according to the neural network prediction video frame of shallow-layer,
Obtain multiple key frames in the video frame and multiple non-key frames;Second obtains module, for being divided according to picture semantic
Network obtains the high-order semantic feature of the multiple key frame, according to the timing information of high-order semantic feature, predicts described more
The high-order semantic feature of a non-key frame;Divide module, for high-order semantic feature to the multiple key frame and described more
A non-key high-order semantic feature is classified, and samples default size, generates video semanteme segmentation result.
The video semanteme segmenting device that feature propagation is carried out based on prediction of the embodiment of the present invention is passed through adaptively first
Key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame and this key
Frame has Semantic Similarity, and furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains, can be with
Ensure accuracy of the video semanteme segmentation by prediction;It can be believed using the timing of high-order feature secondly by the method for prediction
Breath, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature;It is obtained by prediction
High-order feature, although main semantic information in video frame can be obtained, being the absence of low order information has fine sky
Between positional relationship and marginal information.By merging low order feature, high-order can be helped by being finely adjusted to the high-order feature of prediction
Feature obtains finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.
In addition, the video semanteme segmenting device according to the above embodiment of the present invention for carrying out feature propagation based on prediction may be used also
With following additional technical characteristic:
Further, in one embodiment of the invention, further includes: training module,
The training module, for generating and training the picture semantic segmentation network, to each of the video frame
Frame carries out semantic segmentation.
Further, in one embodiment of the invention, described first module is obtained, comprising:
Judging unit, for judging whether the semantic difference of present frame and previous key frame is greater than preset threshold;
Confirmation unit, when being greater than the preset threshold for the semantic difference of present frame and previous key frame, confirmation is worked as
Previous frame is new key frame, and otherwise, present frame is non-key frame.
Further, in one embodiment of the invention, the second acquisition module is specifically used for,
If present frame is key frame, directly acquired currently according to the semantic segmentation technology that the picture semantic divides network
The high-order semantic feature of frame;
If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the height of the key frame of previous frame
The high-order semantic feature of rank semantic feature prediction present frame;
If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the front cross frame of present frame
High-order semantic feature prediction present frame high-order semantic feature.
Further, in one embodiment of the invention, further includes: adjustment module,
The adjustment module, the space component for dividing network by the picture semantic generate the multiple non-
The low order feature of key frame is adjusted according to high-order semantic feature of the low order feature to the multiple non-key frame of prediction
It is whole, obtain the high-order semantic feature of accurate the multiple non-key frame.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments
Obviously and it is readily appreciated that, in which:
Fig. 1 is the comparison diagram according to one embodiment of the invention and traditional transmission method;
Fig. 2 is the video semanteme dividing method process that feature propagation is carried out based on prediction according to one embodiment of the invention
Figure;
Fig. 3 is the timing flow chart according to one embodiment of the invention;
Fig. 4 is the flow chart according to the high-order feature of one embodiment of the invention in timing;
Fig. 5 is the key frame extraction result figure according to one embodiment of the invention;
Fig. 6 is the video semanteme segmenting device structure that feature propagation is carried out based on prediction according to one embodiment of the invention
Figure.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.
The video semanteme that feature propagation is carried out based on prediction proposed according to embodiments of the present invention is described with reference to the accompanying drawings
Dividing method and device.
The video language that feature propagation is carried out based on prediction proposed according to embodiments of the present invention is described with reference to the accompanying drawings first
Adopted dividing method.
As shown in Figure 1, illustrating the comparison diagram of the embodiment of the present invention and traditional transmission method.Wherein Fig. 1 (a) is base
In the picture semantic segmentation of connection convolutional network entirely;Fig. 1 (b) and Fig. 1 (c) are respectively shown based on light stream and space invariance core
Feature propagation method, Fig. 1 (d) is the method that the embodiment of the present invention is proposed.The embodiment of the present invention is first with high-order language
The timing information of adopted feature obtains the high-order semantic feature of present frame by way of prediction, then uses low order position and line
Feature is managed to adjust the high-order semantic feature of prediction, so that final high-order semantic feature not only has abstract semantic feature also
The spatial informations such as position and texture are merged.The advantages of this method is not need to make the assumption that high-order and low order feature, is led to
It crosses prediction and fine tuning obtains video semanteme segmentation.
Fig. 2 is the video semanteme dividing method process that feature propagation is carried out based on prediction according to one embodiment of the invention
Figure.
As shown in Fig. 2, should based on prediction carry out feature propagation video semanteme dividing method the following steps are included:
In step s101, it according to the semantic difference of the neural network prediction video frame of shallow-layer, obtains more in video frame
A key frame and multiple non-key frames.
Further, in one embodiment of the invention, poor according to the semanteme of the neural network prediction video frame of shallow-layer
It is different, multiple key frames in video frame and multiple non-key frames are obtained, is specifically included:
Judge whether the semantic difference of present frame and previous key frame is greater than preset threshold;
If it is greater than preset threshold, then present frame is new key frame, and otherwise, present frame is non-key frame.
Specifically, the selection of key frame can guarantee the compactness between video frame, that is, the video between key frame
Frame has the similitude of height, guarantees the accuracy of high-order feature by way of predicting or propagating.Traditional key frame choosing
Taking method is to determine a key frame every fixed frame number, and this mode is unable to the semantic difference of reflecting video frame.
Further, key frame, while the height of non-key frame are chosen using a kind of adaptive key frame extraction method
Rank semantic information is to propagate to obtain by the high-order semantic information of previous key frame, can ensure proposed video semanteme in this way
The accuracy of segmentation.This method can reduce the time complexity of algorithm under the premise of guaranteeing Video segmentation accuracy.
The embodiment of the present invention is using the neural network of a shallow-layer come the abstract semantics difference of predicted video frame, dynamic choosing
Take key frame.Using the convolution kernel in two layers of 256 channels, global mean value pond layer and a full articulamentum are returned between video frame
The deviation of high-order semantic information.The input of network is the low order that two video frames pass through that the space branch of basic network obtains respectively
Feature, the output of network are semantic deviations corresponding to the two video frames.
If previous key frame and present frame are more than a threshold value by the deviation of the high-order semantic information of neural network forecast, that
Being judged as the frame is a new key frame, and otherwise deciding that the frame not is a key frame.Fig. 5 illustrates key frame extraction
As a result, being wherein all judged to key frame more than the frame of threshold value.With between present frame and a upper key frame, frame number difference increases
Add, high-order semantic information deviation is also gradually increased, and when semantic deviation between the two is greater than a threshold value, determines to work as
Previous frame is new key frame.If deviation between the two is less than threshold value, determine that present frame is non-key frame.Key frame and non-pass
Key frame chooses the acquisition that will have a direct impact on high-order feature.This key frame extraction method ensure that the video frame between key frame
Similitude with height, it is reasonable for also ensuring that the mode of prediction obtains video frame high-order semantic information.
In step s 102, the high-order semantic feature that network obtains multiple key frames is divided according to picture semantic, according to height
The timing information of rank semantic feature predicts the high-order semantic feature of multiple non-key frames.
Further, in one embodiment of the invention, further includes: it generates and training picture semantic divides network, with
Semantic segmentation is carried out to each frame of video frame.
Specifically, video semanteme partitioning algorithm needs to propagate accurate high-order feature, to obtain accurate prediction result.
It is trained that the method for the embodiment of the present invention needs to obtain one in advance, and can carry out semantic segmentation to each frame of video frame
Network, based on network extract the high-order feature of key frame.BiSeNet frame is chosen in embodiments of the present invention to make
Based on picture semantic divide network, this network is there are two branch: first branch is space branch, for extracting picture
Low order space and position feature;Another branch is context semanteme branch, for obtaining the context semanteme letter of picture
Breath.Final high-order feature is obtained by fusion space branch and context semanteme branch.Select the original of BiSeNet frame
Because being that the feature that two branch extracts has complementarity, the high-order semantic information of video frame has similitude, can be by pre-
The mode of survey obtains the main semantic information of video frame, and specific position and spatial information, which can merge low order feature, to be come
It arrives, can be to avoid calculating high-order feature due to this two-part complementarity, while there is context semantic information and space letter
Breath.Furthermore the two branches can effectively realize parallel on hardware device.
Further, in one embodiment of the invention, network is divided according to picture semantic and obtains multiple key frames
High-order semantic feature predicts the high-order semantic feature of multiple non-key frames, specifically according to the timing information of high-order semantic feature
Include:
If present frame is key frame, present frame is directly acquired according to the semantic segmentation technology that picture semantic divides network
High-order semantic feature;
If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the height of the key frame of previous frame
The high-order semantic feature of rank semantic feature prediction present frame;
If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the front cross frame of present frame
High-order semantic feature prediction present frame high-order semantic feature.
Further, it predicts to obtain the high-order semanteme letter of present frame by the front cross frame high-order semantic information of present frame
The i.e. high-order feature of relationship of the high-order feature in timing can be utilized to avoid calculating the high-order feature of present frame in breath
Motion information.
Specifically, as shown in figures 1 and 3, for the video frame of t moment, if it is judged as key frame, then passing through
Underlying semantics cutting techniques obtain its accurate high-order feature Ft, it should be noted that the high-order feature of key frame is not prediction
It obtains, but passes through entire basic network;If the video frame at t+1 moment is also judged as key frame, the t+1 moment is high
Rank feature obtains accurate high-order feature by entire basic network as t moment, and otherwise the video frame at t+1 moment is not
It is judged as key frame, then needing with the high-order feature of t moment come the high-order feature to t+1 moment video frameIt carries out pre-
It surveys, while the low order feature obtained according to the space component of basic networkCome the high-order feature obtained to prediction
Fusion fine tuning is carried out, finer space and location information are made it have.Finally obtain the high dimensional feature F at t+1 momentt+1.It is logical
The continuity of video data in time is utilized in the high-order feature for crossing the model prediction non-key frame.
Method based on feature propagation thinks such as the similar high-order feature more phase so corresponding to them of two frames of K fruit
Seemingly:
f(x1)-f(x2)|<K|x1-x2|
Wherein x1And x2It is two given pictures, f (x1)With f (x2) it is its corresponding high-order feature, K is a constant.
A video frame is given, if it is considered as key frame by previous stage S0, then this picture is needed through entire basis
Semantic segmentation network obtains its high-order feature Ft。
Further, in one embodiment of the invention, further includes: divide the space branch of network by picture semantic
Part generates the low order feature of multiple non-key frames, according to low order feature to the high-order semantic feature of multiple non-key frames of prediction
It is adjusted, obtains the high-order semantic feature of accurate multiple non-key frames.
Wherein, low order feature includes low order position and the textural characteristics of non-key frame, includes more spatial positional informations
And fine marginal information, these information help to obtain more accurate semantic segmentation.
Further, although the high-order semantic information obtained by predicted current frame can obtain the main language of present frame
Adopted feature, but the spatial informations such as position and texture for being the absence of picture.The method of the present embodiment needs to merge the low of present frame
Rank feature is adjusted the high-order semantic feature of prediction, so that the finally obtained high-order feature of present frame includes more empty
Between location information and fine marginal information, these information help to obtain more accurate semantic segmentation.
If a video frame is not considered as key frame, then being predicted first by the high-order feature of its front cross frame current
The high-order feature of frame, while the video frame passes through the space branch of basic network, obtains its low order feature.Utilize low order feature pair
The high-order feature of prediction carries out fusion fine tuning, to obtain having the high-order feature of contextual information and spatial information.
Fig. 4 illustrates flow chart of the high-order feature in timing, and for key frame t, high-order feature can be passed to
Non-key frame t+1 and the t+2 moment, for predicting respective high-order feature.High-order feature is obtained by way of prediction two
Kind mode, first way is present frame followed by key frame, then can only be predicted using the high-order feature of key frame;The
It is not key frame that two kinds of modes, which are present frames, and not the next frame of key frame, then can use the front cross frame of present frame
High-order feature predicted.Due to the similitude between video frame, the difficulty of prediction high-order feature is reduced.The present invention selects
The network of shallow-layer predicts high-order feature.This shallow-layer network includes 2 blocks, wherein each piece includes 2 convolution
Layer, one BN layers and one Relu layers, each piece is combined by ResBlock structure.
In step s 103, the high-order semantic feature of multiple key frames and multiple non-key high-order semantic features are carried out
Classification, and default size is sampled, generate video semanteme segmentation result.
Specifically, the high-order feature for obtaining each video frame needs to classify to it later, using softmax to height
Rank feature is classified, and original picture size is then upsampled to, and finally obtains the result of video semanteme segmentation.
What is proposed according to embodiments of the present invention carries out the video semanteme dividing method of feature propagation based on prediction, passes through first
Adaptive key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame
There is Semantic Similarity with this key frame, furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains
It arrives, it can be ensured that the accuracy that video semanteme segmentation passes through prediction;High-order feature can be utilized secondly by the method for prediction
Timing information, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature;By pre-
The high-order feature measured, although main semantic information in video frame can be obtained, being the absence of low order information has essence
Thin spatial relation and marginal information.By merging low order feature, the high-order feature of prediction is finely adjusted and can be helped
High-order feature is helped to obtain finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.
The video semanteme that feature propagation is carried out based on prediction proposed according to embodiments of the present invention referring next to attached drawing description
Segmenting device.
Fig. 6 is the video semanteme segmenting device structure that feature propagation is carried out based on prediction according to one embodiment of the invention
Schematic diagram.
As shown in fig. 6, the video semanteme segmenting device 10 includes: that the first acquisition module 100, second obtains 200 and of module
Divide module 300.
Wherein, the first acquisition module 100 is used for the semantic difference of the neural network prediction video frame according to shallow-layer, obtains view
Multiple key frames and multiple non-key frames in frequency frame.
Second acquisition module 200 is used to divide the high-order semantic feature that network obtains multiple key frames according to picture semantic,
According to the timing information of high-order semantic feature, the high-order semantic feature of multiple non-key frames is predicted.
Divide module 300 be used for the high-order semantic features of multiple key frames and multiple non-key high-order semantic features into
Row classification, and default size is sampled, generate video semanteme segmentation result.
The video semanteme segmenting device 10 do not need to make the assumption that high-order and low order feature, by predicting and fine tuning obtains
Video semanteme segmentation.
Further, in one embodiment of the invention, further includes: training module,
Training module carries out semantic point for generating and training picture semantic to divide network with each frame to video frame
It cuts.
Further, in one embodiment of the invention, first module is obtained, comprising:
Judging unit, for judging whether the semantic difference of present frame and previous key frame is greater than preset threshold;
Confirmation unit when being greater than preset threshold for the semantic difference of present frame and previous key frame, confirms present frame
For new key frame, otherwise, present frame is non-key frame.
Further, in one embodiment of the invention, the second acquisition module is specifically used for,
If present frame is key frame, present frame is directly acquired according to the semantic segmentation technology that picture semantic divides network
High-order semantic feature;
If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the height of the key frame of previous frame
The high-order semantic feature of rank semantic feature prediction present frame;
If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the front cross frame of present frame
High-order semantic feature prediction present frame high-order semantic feature.
Further, in one embodiment of the invention, further includes: adjustment module, for being divided by picture semantic
The space component of network generates the low order feature of multiple non-key frames, according to low order feature to multiple non-key frames of prediction
High-order semantic feature be adjusted, obtain the high-order semantic feature of accurate multiple non-key frames.
It should be noted that the aforementioned explanation to the video semanteme dividing method embodiment for carrying out feature propagation based on prediction
Illustrate the device for being also applied for the embodiment, details are not described herein again.
What is proposed according to embodiments of the present invention carries out the video semanteme segmenting device of feature propagation based on prediction, passes through first
Adaptive key frame extraction method judges whether present frame is new key frame.The selection of key frame ensures subsequent frame
There is Semantic Similarity with this key frame, furthermore the prediction of high-order feature is all based on this key frame order of information propagation forecast and obtains
It arrives, it can be ensured that the accuracy that video semanteme segmentation passes through prediction;High-order feature can be utilized secondly by the method for prediction
Timing information, it is also contemplated that between successive video frames high-order feature similitude, can effectively propagate high-order feature;By pre-
The high-order feature measured, although main semantic information in video frame can be obtained, being the absence of low order information has essence
Thin spatial relation and marginal information.By merging low order feature, the high-order feature of prediction is finely adjusted and can be helped
High-order feature is helped to obtain finer as a result, in the efficiency for guaranteeing to improve algorithm while accuracy.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three
It is a etc., unless otherwise specifically defined.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples
It closes and combines.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example
Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned
Embodiment is changed, modifies, replacement and variant.
Claims (10)
1. a kind of video semanteme dividing method for carrying out feature propagation based on prediction, which comprises the following steps:
According to the semantic difference of the neural network prediction video frame of shallow-layer, multiple key frames in the video frame and multiple are obtained
Non-key frame;
Divide network according to picture semantic and obtain the high-order semantic feature of the multiple key frame, according to high-order semantic feature when
Sequence information predicts the high-order semantic feature of the multiple non-key frame;
High-order semantic feature and the multiple non-key high-order semantic feature to the multiple key frame are classified, and are adopted
Sample generates video semanteme segmentation result to default size.
2. the method according to claim 1, wherein further include:
The picture semantic segmentation network is generated and trained, semantic segmentation is carried out with each frame to the video frame.
3. the method according to claim 1, wherein the language of the neural network prediction video frame according to shallow-layer
Adopted difference obtains multiple key frames in the video frame and multiple non-key frames, specifically includes:
Judge whether the semantic difference of present frame and previous key frame is greater than preset threshold;
If it is greater than the preset threshold, then present frame is new key frame, and otherwise, present frame is non-key frame.
4. the method according to claim 1, wherein described the multiple according to picture semantic segmentation network acquisition
The high-order semantic feature of key frame predicts the high-order of the multiple non-key frame according to the timing information of high-order semantic feature
Semantic feature specifically includes:
If present frame is key frame, present frame is directly acquired according to the semantic segmentation technology that the picture semantic divides network
High-order semantic feature;
If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the high-order language of the key frame of previous frame
The high-order semantic feature of adopted feature prediction present frame;
If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the height of the front cross frame of present frame
The high-order semantic feature of rank semantic feature prediction present frame.
5. the method according to claim 1, wherein further include:
The low order feature that the multiple non-key frame is generated by the space component that the picture semantic divides network, according to
The low order feature is adjusted the high-order semantic feature of the multiple non-key frame of prediction, obtains accurately the multiple
The high-order semantic feature of non-key frame.
6. a kind of video semanteme segmenting device for carrying out feature propagation based on prediction characterized by comprising
First obtains module, for the semantic difference according to the neural network prediction video frame of shallow-layer, obtains in the video frame
Multiple key frames and multiple non-key frames;
Second obtains module, for dividing the high-order semantic feature that network obtains the multiple key frame, root according to picture semantic
According to the timing information of high-order semantic feature, the high-order semantic feature of the multiple non-key frame is predicted;
Divide module, for the high-order semantic feature and the multiple non-key high-order semantic feature to the multiple key frame
Classify, and sample default size, generates video semanteme segmentation result.
7. device according to claim 6, which is characterized in that further include: training module,
The training module, for generating and training picture semantic segmentation network, with each frame to the video frame into
Row semantic segmentation.
8. device according to claim 6, which is characterized in that described first obtains module, comprising:
Judging unit, for judging whether the semantic difference of present frame and previous key frame is greater than preset threshold;
Confirmation unit when being greater than the preset threshold for the semantic difference of present frame and previous key frame, confirms present frame
For new key frame, otherwise, present frame is non-key frame.
9. device according to claim 6, which is characterized in that the second acquisition module is specifically used for,
If present frame is key frame, present frame is directly acquired according to the semantic segmentation technology that the picture semantic divides network
High-order semantic feature;
If present frame is non-key frame, and the previous frame of present frame is key frame, then according to the high-order language of the key frame of previous frame
The high-order semantic feature of adopted feature prediction present frame;
If present frame is non-key frame, and the previous frame of present frame is also non-key frame, then according to the height of the front cross frame of present frame
The high-order semantic feature of rank semantic feature prediction present frame.
10. device according to claim 6, which is characterized in that further include: adjustment module,
The adjustment module, the space component for dividing network by the picture semantic generate the multiple non-key
The low order feature of frame is adjusted according to high-order semantic feature of the low order feature to the multiple non-key frame of prediction,
Obtain the high-order semantic feature of accurate the multiple non-key frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910120021.XA CN109919044A (en) | 2019-02-18 | 2019-02-18 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910120021.XA CN109919044A (en) | 2019-02-18 | 2019-02-18 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109919044A true CN109919044A (en) | 2019-06-21 |
Family
ID=66961650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910120021.XA Pending CN109919044A (en) | 2019-02-18 | 2019-02-18 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109919044A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796058A (en) * | 2019-10-23 | 2020-02-14 | 深圳龙岗智能视听研究院 | Video behavior identification method based on key frame extraction and hierarchical expression |
CN111062395A (en) * | 2019-11-27 | 2020-04-24 | 北京理工大学 | Real-time video semantic segmentation method |
CN111161306A (en) * | 2019-12-31 | 2020-05-15 | 北京工业大学 | Video target segmentation method based on motion attention |
CN111310594A (en) * | 2020-01-20 | 2020-06-19 | 浙江大学 | Video semantic segmentation method based on residual error correction |
CN111523442A (en) * | 2020-04-21 | 2020-08-11 | 东南大学 | Self-adaptive key frame selection method in video semantic segmentation |
CN112364822A (en) * | 2020-11-30 | 2021-02-12 | 重庆电子工程职业学院 | Automatic driving video semantic segmentation system and method |
CN112527993A (en) * | 2020-12-17 | 2021-03-19 | 浙江财经大学东方学院 | Cross-media hierarchical deep video question-answer reasoning framework |
CN112613516A (en) * | 2020-12-11 | 2021-04-06 | 北京影谱科技股份有限公司 | Semantic segmentation method for aerial video data |
CN114143541A (en) * | 2021-11-09 | 2022-03-04 | 华中科技大学 | Cloud edge collaborative video compression uploading method and device for semantic segmentation |
CN116883915A (en) * | 2023-09-06 | 2023-10-13 | 常州星宇车灯股份有限公司 | Target detection method and system based on front and rear frame image association |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590442A (en) * | 2017-08-22 | 2018-01-16 | 华中科技大学 | A kind of video semanteme Scene Segmentation based on convolutional neural networks |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
-
2019
- 2019-02-18 CN CN201910120021.XA patent/CN109919044A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590442A (en) * | 2017-08-22 | 2018-01-16 | 华中科技大学 | A kind of video semanteme Scene Segmentation based on convolutional neural networks |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
Non-Patent Citations (2)
Title |
---|
XIZHOU ZHU ET AL.: "Deep Feature Flow for Video Recognition", 《ARXIV》 * |
YULE LI ET AL.: "Low-Latency Video Semantic Segmentation", 《ARXIV》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796058A (en) * | 2019-10-23 | 2020-02-14 | 深圳龙岗智能视听研究院 | Video behavior identification method based on key frame extraction and hierarchical expression |
CN111062395A (en) * | 2019-11-27 | 2020-04-24 | 北京理工大学 | Real-time video semantic segmentation method |
CN111161306A (en) * | 2019-12-31 | 2020-05-15 | 北京工业大学 | Video target segmentation method based on motion attention |
CN111161306B (en) * | 2019-12-31 | 2023-06-02 | 北京工业大学 | Video target segmentation method based on motion attention |
CN111310594A (en) * | 2020-01-20 | 2020-06-19 | 浙江大学 | Video semantic segmentation method based on residual error correction |
CN111310594B (en) * | 2020-01-20 | 2023-04-28 | 浙江大学 | Video semantic segmentation method based on residual error correction |
CN111523442A (en) * | 2020-04-21 | 2020-08-11 | 东南大学 | Self-adaptive key frame selection method in video semantic segmentation |
CN112364822B (en) * | 2020-11-30 | 2022-08-19 | 重庆电子工程职业学院 | Automatic driving video semantic segmentation system and method |
CN112364822A (en) * | 2020-11-30 | 2021-02-12 | 重庆电子工程职业学院 | Automatic driving video semantic segmentation system and method |
CN112613516A (en) * | 2020-12-11 | 2021-04-06 | 北京影谱科技股份有限公司 | Semantic segmentation method for aerial video data |
CN112527993A (en) * | 2020-12-17 | 2021-03-19 | 浙江财经大学东方学院 | Cross-media hierarchical deep video question-answer reasoning framework |
CN114143541A (en) * | 2021-11-09 | 2022-03-04 | 华中科技大学 | Cloud edge collaborative video compression uploading method and device for semantic segmentation |
CN114143541B (en) * | 2021-11-09 | 2023-02-14 | 华中科技大学 | Cloud edge collaborative video compression uploading method and device for semantic segmentation |
CN116883915A (en) * | 2023-09-06 | 2023-10-13 | 常州星宇车灯股份有限公司 | Target detection method and system based on front and rear frame image association |
CN116883915B (en) * | 2023-09-06 | 2023-11-21 | 常州星宇车灯股份有限公司 | Target detection method and system based on front and rear frame image association |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109919044A (en) | The video semanteme dividing method and device of feature propagation are carried out based on prediction | |
CN110322446B (en) | Domain self-adaptive semantic segmentation method based on similarity space alignment | |
CN106157307B (en) | A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF | |
CN108961327A (en) | A kind of monocular depth estimation method and its device, equipment and storage medium | |
US20210150203A1 (en) | Parametric top-view representation of complex road scenes | |
US20080310717A1 (en) | Apparatus and Method for Image Labeling | |
CN109903310A (en) | Method for tracking target, device, computer installation and computer storage medium | |
CN111767927A (en) | Lightweight license plate recognition method and system based on full convolution network | |
CN110705412A (en) | Video target detection method based on motion history image | |
US20220398737A1 (en) | Medical image segmentation method based on u-network | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN111402126A (en) | Video super-resolution method and system based on blocks | |
CN103279473A (en) | Method, system and mobile terminal for searching massive amounts of video content | |
CN105046689A (en) | Method for fast segmenting interactive stereo image based on multilayer graph structure | |
CN113850900A (en) | Method and system for recovering depth map based on image and geometric clue in three-dimensional reconstruction | |
CN114463492A (en) | Adaptive channel attention three-dimensional reconstruction method based on deep learning | |
CN116485860A (en) | Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features | |
CN112686233B (en) | Lane line identification method and device based on lightweight edge calculation | |
Birchfield | Depth and motion discontinuities | |
CN112164065B (en) | Real-time image semantic segmentation method based on lightweight convolutional neural network | |
CN115578436A (en) | Monocular depth prediction method based on multi-level feature parallel interaction fusion | |
CN112598043B (en) | Collaborative saliency detection method based on weak supervised learning | |
CN117333682A (en) | Multi-view three-dimensional reconstruction method based on self-attention mechanism | |
CN112802026A (en) | Deep learning-based real-time traffic scene semantic segmentation method | |
CN117156078B (en) | Video data processing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190621 |