CN107808389A

CN107808389A - Unsupervised methods of video segmentation based on deep learning

Info

Publication number: CN107808389A
Application number: CN201711004135.5A
Authority: CN
Inventors: 宋利; 许经纬; 解蓉; 张文军
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2017-10-24
Filing date: 2017-10-24
Publication date: 2018-03-16
Anticipated expiration: 2037-10-24
Also published as: CN107808389B

Abstract

The invention provides a kind of unsupervised methods of video segmentation based on deep learning, including：Coding and decoding deep neural network is established, coding and decoding deep neural network includes：Still image segmentation flow network, inter-frame information segmentation flow network and UNE；Still image segmentation flow network is used to carry out current video frame prospect background dividing processing, and the prospect background that inter-frame information segmentation flow network is used to carry out the optical flow field information between current video frame and next frame of video moving object is split；After the segmentation figure picture of still image segmentation flow network and inter-frame information segmentation flow network output is merged by UNE, Video segmentation result is obtained.The still image segmentation flow network of the present invention is used to split in the frame of high quality, the optical flow field information that inter-frame information segmentation flow network is used for high quality is split, two-way exports the segmentation result after being got a promotion by last mixing operation, so as to obtain preferable segmentation result according to effective doubleway output and mixing operation.

Description

Unsupervised methods of video segmentation based on deep learning

Technical field

The present invention relates to technical field of video processing, in particular it relates to the unsupervised Video segmentation side based on deep learning Method.

Background technology

Video segmentation refers to that carrying out prospect background to the object in each frame of video splits the process that obtains binary map, its Difficult point is the density that should ensure spatial domain (in frame) segmentation, also to ensure the company of time-domain (inter-frame information) segmentation simultaneously Continuous property.The Video segmentation of high quality is video editing, VS identification, the basis of Video Semantic Analysis, thus with very Important meaning.

Existing methods of video segmentation can be roughly divided into following three class according to its principle：

1) based on unsupervised conventional video dividing method

Such method need not manually participate in marking key frame such as (the first frame) information, and general step is that image segmentation adds The similar Block- matching of interframe, the automatic given video of segmentation.If A.aktor and M.Irani et al. are being delivered BMVC in 2014 " every frame is handled to obtain some in the texts of Video segmentation by non-local consensus voting " one can The segmentation (object proposal) of object can be included, these segmentations is then based on and carries out interframe similarity detection, screening is similar Degree highest segmentation is used as segmentation result.The advantages of such method is not need manual intervention, but needs to calculate substantial amounts of segmentation Intermediate form such as super-pixel point (superpixels) etc., consumes substantial amounts of time and memory space.

2) based on semi-supervised conventional video dividing method

Such method generally requires artificial participation mark key frame (such as the first frame or former frames) information, then marks these The segmentation information being poured in is transmitted to follow-up all frames by way of interframe transmission.Such as Y.-H.Tsai, M.-H.Yang and M.J.Black et al. " proposes what CVPR in 2016 was delivered in the texts of Video segmentation via object flow " one All frames are put into a figure using the method for global figure, the side of figure represents the similarity of interframe, schemes eventually through solving Segmentation the first frame marked is transmitted to follow-up frame.This method is accuracy rate highest method in conventional method, because it The information of each frame is considered in optimization process, but due to the difficulty that global figure solves, the time for calculating segmentation greatly increases. This is also general character-segmentation accuracy rate height of such method but computation complexity is also very high simultaneously.

3) method based on deep learning

With the development of deep learning, deep neural network all achieves in fields such as image classification, segmentation, identifications to be compared Good result, but in video field be limited to the higher redundancy of time-domain it does not play powerful effect also completely. S.Caelles, K.Maninis, J.Pont-Tuset, L.Leal-Taixe, D.Cremers, and L.Van Gool et al. in What CVPR in 2017 was delivered " proposes that Video segmentation only needs in the texts of One-shot video object segmentation " one Single frames segmentation is carried out per frame to video, it is not necessary to rely on inter-frame information.They think that inter-frame information is redundancy, it is not necessary that , and reference can be used as without accurate inter-frame information in many cases, thus the scheme that they provide is training one Strong image segmentation network, then in the given video of segmentation, to the first frame or above some frames are accurately marked, with these frames Go to finely tune (finetune) big network, finally remove other frames for splitting the video with this network.What this method had an over-fitting can Energy property, and it is not applied for large-scale Video segmentation scene.

The content of the invention

For in the prior art the defects of, it is an object of the invention to provide a kind of unsupervised video based on deep learning point Segmentation method.

According to the unsupervised methods of video segmentation provided by the invention based on deep learning, including：

Coding and decoding deep neural network is established, the coding and decoding deep neural network includes：Still image segmentation stream Network, inter-frame information segmentation flow network and UNE；Wherein, the still image segmentation flow network is used for current video Frame carries out prospect background dividing processing, and the inter-frame information segmentation flow network is used for the current video frame and next frame of video Between optical flow field information carry out moving object prospect background segmentation；

The segmentation figure picture that the still image is split to flow network and inter-frame information segmentation flow network output is melted by described After conjunction network is merged, Video segmentation result is obtained.

Alternatively, it is described to establish coding and decoding deep neural network, including：

Still image segmentation flow network is established, and by having carried out the image of still image segmentation to the still image Segmentation flow network is trained；

Inter-frame information segmentation flow network is established, and by having carried out the video of inter-frame information segmentation to the inter-frame information Segmentation flow network is trained；

The coding and decoding deep neural network is trained using the video segmentation data marked completely.

Alternatively, the still image segmentation flow network includes：The coded portion and decoded portion that full convolutional network is formed, Wherein,

The full convolutional network of coded portion includes：The generalized convolution layer and join with layer 5 generalized convolution level that five levels join One layer of expansion convolutional layer, the expansion convolutional layer positioned at layer 6 includes the expansion of four class different scales, per a kind of composition One output road, the average value of the output result on four classes output road is the output result of the coded portion；

The full convolutional network of decoded portion is：The full convolutional network that three layers of cyclic convolution layer and three layers of up-sampling layer are formed； The full convolutional network of the decoded portion, for exporting the picture segmentation result consistent with inputting photo resolution.

Alternatively, in the full convolutional network of coded portion five layers of generalized convolution layer include cascade the first generalized convolution layer, Second generalized convolution layer, the 3rd generalized convolution layer, the 4th generalized convolution layer, the 5th generalized convolution layer, wherein：

First generalized convolution layer includes successively：Convolutional layer A11, active coating, convolutional layer A12, active coating, pond layer；

Second generalized convolution layer includes successively：Convolutional layer A21, active coating, convolutional layer A22, active coating, pond layer；

3rd generalized convolution layer includes successively：Convolutional layer A31, active coating, convolutional layer A32, active coating, convolutional layer A33, swash Work layer, pond layer；

4th generalized convolution layer includes successively：Convolutional layer A41, active coating, convolutional layer A42, active coating, convolutional layer A43, swash Work layer, pond layer；

5th generalized convolution layer includes successively：Convolutional layer A51, active coating, convolutional layer A52, active coating, convolutional layer A53, swash Work layer, pond layer；

The expansion convolutional layer joined in the full convolutional network of coded portion with layer 5 generalized convolution level includes：And The four classes expansion convolutional layer of connection, wherein：

First kind expansion convolutional layer includes successively：First yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, Active coating, random drop layer, convolutional layer；

Second class expansion convolutional layer includes successively：Second yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, Active coating, random drop layer, convolutional layer；

3rd class expansion convolutional layer includes successively：3rd yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, Active coating, random drop layer, convolutional layer；

4th class expansion convolutional layer includes successively：4th yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, Active coating, random drop layer, convolutional layer.

Alternatively, in the full convolutional network of decoded portion, every layer of up-sampling layer joins with corresponding cyclic convolution level, its In：

First up-sampling layer and the 3rd cyclic convolution level join, and the output that the first up-sampling layer is used for last layer is entered Row up-samples for twice；The 3rd cyclic convolution layer be used for by coded portion convolutional layer A33 output carry out process of convolution and and The output of first up-sampling layer carries out cyclic convolution operation；

Second up-sampling layer is cascaded with second circulation convolutional layer, and the output that the second up-sampling layer is used for last layer is entered Row up-samples for twice；The second circulation convolutional layer be used for by coded portion convolutional layer A22 output carry out process of convolution and and The output of second up-sampling layer carries out cyclic convolution operation；

3rd up-sampling layer is cascaded with first circulation convolutional layer, and the output that the 3rd up-sampling layer is used for last layer is entered Row up-samples for twice；The first circulation convolutional layer be used for by coded portion convolutional layer A12 output carry out process of convolution and and The output of 3rd up-sampling layer carries out cyclic convolution operation.

Alternatively, the image by having carried out still image segmentation is split flow network to the still image and carried out Training, including：

Choose ECSSD images partitioned data set, MSRA 10K images partitioned data sets and the images point of PASCAL VOC 2012 Cut the samples pictures in data set；

10 are expanded to after carrying out random cropping, mirror image, upset, zoom, affine transformation to the samples pictures⁴Magnitude Data bulk；

Fixed decoded portion, go to train coded portion using 60% data, until coded portion is restrained；

The still image segmentation flow network is trained using 100% training data；Wherein, the coded portion uses receipts Weights when holding back are initialized, and decoded portion carries out random initializtion.

Alternatively, the inter-frame information segmentation flow network includes：The coded portion for the mutual cascade that full convolutional network is formed And decoded portion；Wherein：

Five layers of generalized convolution layer include the first generalized convolution layer of cascade, the second broad sense in the full convolutional network of coded portion Convolutional layer, the 3rd generalized convolution layer, the 4th generalized convolution layer, the 5th generalized convolution layer, wherein：

First generalized convolution layer includes successively：Convolutional layer B11, active coating, convolutional layer B12, active coating, pond layer；

Second generalized convolution layer includes successively：Convolutional layer B21, active coating, convolutional layer B22, active coating, pond layer；

3rd generalized convolution layer includes successively：Convolutional layer B31, active coating, convolutional layer B32, active coating, convolutional layer B33, swash Work layer, pond layer；

4th generalized convolution layer includes successively：Convolutional layer B41, active coating, convolutional layer B42, active coating, convolutional layer B43, swash Work layer, pond layer；

5th generalized convolution layer includes successively：Convolutional layer B51, active coating, convolutional layer B52, active coating, convolutional layer B53, swash Work layer, pond layer；

4th class expansion convolutional layer includes successively：4th yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, Active coating, random drop layer, convolutional layer；

The full convolutional network of decoded portion is：The full convolutional network that three layers of cyclic convolution layer and three layers of up-sampling layer are formed； The full convolutional network of the decoded portion, for exporting the picture segmentation result consistent with inputting photo resolution；Wherein：

In the full convolutional network of decoded portion, every layer of up-sampling layer joins with corresponding cyclic convolution level, wherein：

First up-sampling layer and the 3rd cyclic convolution level join, and the output that the first up-sampling layer is used for last layer is entered Row up-samples for twice；The 3rd cyclic convolution layer be used for by coded portion convolutional layer B33 output carry out process of convolution and and The output of first up-sampling layer carries out cyclic convolution operation；

Second up-sampling layer is cascaded with second circulation convolutional layer, and the output that the second up-sampling layer is used for last layer is entered Row up-samples for twice；The second circulation convolutional layer be used for by coded portion convolutional layer B22 output carry out process of convolution and and The output of second up-sampling layer carries out cyclic convolution operation；

3rd up-sampling layer is cascaded with first circulation convolutional layer, and the output that the 3rd up-sampling layer is used for last layer is entered Row up-samples for twice；The first circulation convolutional layer be used for by coded portion convolutional layer B12 output carry out process of convolution and and The output of 3rd up-sampling layer carries out cyclic convolution operation.

Alternatively, the video by having carried out inter-frame information segmentation is split flow network to the inter-frame information and carried out Training, including：

The training video collection VID that VS detects in ILSVRC2015 is collected, wherein, in the training video collection VID There is the indicia framing of complete object detection；

The still image segmentation flow network obtained using training does image segmentation to video set VID every frame, obtains prospect Background segment result；

Calculate the optical flow field of each video interframe and preserve optical flow field information corresponding to the every frame of video into RGB and scheme；

The correct image of segmentation is filtered out with reference to the indicia framing in the training video collection VID according to default screening strategy Initial training image of the segmentation result as inter-frame information segmentation flow network；Wherein, the screening strategy meets following bar Part：

First：Carrying out the result of image segmentation per frame to video, to occupy the scope of object detection flag frame be 75% to arrive 90%；

Second：The average light stream range value for the optical flow field RGB figures being calculated is between 5 to 100；

10 are expanded to after the initial training image is carried out into random cropping, mirror image, upset, zoom, affine transformation⁴ The data bulk of magnitude；

The inter-frame information segmentation flow network is trained using 100% training data；Wherein, the coded portion uses receipts Weights when holding back are initialized, and decoded portion carries out random initializtion.

Alternatively, the UNE includes：Articulamentum, convolutional layer, active coating, convolutional layer, active coating；Wherein：

The articulamentum is used to connect the still image segmentation flow network and inter-frame information segmentation flow network, and leads to Cross convolutional layer, active coating, convolutional layer, active coating and split flow network and inter-frame information segmentation flow network to the still image Output result merged, obtain final Video segmentation result.

Alternatively, still image segmentation flow network and inter-frame information the segmentation flow network carries out network in the training process The real-time update of parameter.

Compared with prior art, the present invention has following beneficial effect：

Unsupervised methods of video segmentation provided by the invention based on deep learning, split by structure comprising still image The double-current Video segmentation network of flow network and inter-frame information segmentation flow network, wherein, still image segmentation flow network is used for high-quality Split in the frame of amount, the optical flow field information that inter-frame information segmentation flow network is used for high quality is split, and two-way output passes through last Mixing operation get a promotion after segmentation result；Block when existing, move the conventional method such as slow and can not completely solve the problems, such as When, the present invention still can obtain preferable segmentation result according to effective doubleway output and mixing operation.

Brief description of the drawings

The detailed description made by reading with reference to the following drawings to non-limiting example, further feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 is the schematic diagram of the unsupervised methods of video segmentation of the invention based on deep learning；

Fig. 2 is the principle schematic of cyclic convolution layer in the decoding network that the present invention uses；

Fig. 3 is the effect that generation inter-frame information proposed by the present invention splits the screening strategy of data set needed for flow network training Schematic diagram；

Fig. 4 is the embodiment of the present invention in current best unsupervised approaches and has the comparative result figure of measure of supervision.Wherein, Fast video splits (Fast Object Segmentation in Unconstrained Video, FST) method and object stream Video segmentation (Video Segmentation via Object Flow, OFL) method be respectively it is current best unsupervised and Semi-supervised method.

Embodiment

With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill to this area For personnel, without departing from the inventive concept of the premise, some changes and improvements can also be made.These belong to the present invention Protection domain.

As shown in figure 1, this implementation provides a kind of unsupervised methods of video segmentation based on deep learning, specific implementation details As follows, be practiced without describing in detail below is carried out partly referring to the content of the invention.

Two road networks, including still image segmentation stream and inter-frame information segmentation flow network are built first.The knot of two road networks Structure is identical, and they are all based on coding-decoding architecture：Wherein coded portion is a full convolutional network, including five layers wide Adopted convolutional layer (three first layers have convolutional layer, pond layer and active coating, latter two layers do not have pond layer) and last layer " expansion " Convolutional layer.Wherein last layer is divided into " expansion " of 4 class different scales, is formed all the way per class, the output result of coded portion is This 4 tunnel output result is averaged；Decoded portion is also a full convolutional network, and it is connected on behind coded portion, including three Cyclic convolution layer and three up-sampling layers.The output result of final two-way is all in the same size with input.Coded portion and decoding Partial detailed information is as follows：

The concrete structure of coded portion is as follows：(1~5 layer of the generalized convolution layer being listed below all is cascade operation, the 6th layer 4 tunnels between be parallel operation, be cascade operation between the 6th layer and the 5th layer)

Generalized convolution layer 1：Convolutional layer 1-1+ active coatings+convolutional layer 1-2+ active coatings+pond layer；

Generalized convolution layer 2：Convolutional layer 2-1+ active coatings+convolutional layer 2-2+ active coatings+pond layer；

Generalized convolution layer 3：Convolutional layer 3-1+ active coatings+convolutional layer 3-2+ active coatings+convolutional layer 3-3+ active coatings+pond Layer；

Generalized convolution layer 4：Convolutional layer 4-1+ active coatings+convolutional layer 4-2+ active coatings+convolutional layer 4-3+ active coatings+pond Layer；

Generalized convolution layer 5：Convolutional layer 5-1+ active coatings+convolutional layer 5-2+ active coatings+convolutional layer 5-3+ active coatings+pond Layer；

" expansion " convolutional layer 6-1：" expansion " convolutional layer (dilation=6)+active coating+random drop layer (dropout)+ Convolutional layer+active coating+random drop layer (dropout)+convolutional layer；

" expansion " convolutional layer 6-2：" expansion " convolutional layer (dilation=12)+active coating+random drop layer (dropout) + convolutional layer+active coating+random drop layer (dropout)+convolutional layer；

" expansion " convolutional layer 6-3：" expansion " convolutional layer (dilation=18)+active coating+random drop layer (dropout) + convolutional layer+active coating+random drop layer (dropout)+convolutional layer；

" expansion " convolutional layer 6-4：" expansion " convolutional layer (dilation=24)+active coating+random drop layer (dropout) + convolutional layer+active coating+random drop layer (dropout)+convolutional layer；

The concrete structure of decoded portion is as follows：(the up-sampling layer+cyclic convolution layer 3~1 being listed below all is cascade behaviour Make)

Up-sample layer+cyclic convolution layer 3：Up-sampling layer is that the output to last layer carries out twice of up-sampling.Cyclic convolution Layer 3 is that coded portion convolutional layer 3-3 output is carried out into process of convolution and carries out cyclic convolution operation with the output of up-sampling layer.

Up-sample layer+cyclic convolution layer 2：Up-sampling layer is that the output to last layer carries out twice of up-sampling.Cyclic convolution Layer 2 is that coded portion convolutional layer 2-2 output is carried out into process of convolution and carries out cyclic convolution operation with the output of up-sampling layer.

Up-sample layer+cyclic convolution layer 1：Up-sampling layer is that the output to last layer carries out twice of up-sampling.Cyclic convolution Layer 1 is that coded portion convolutional layer 1-2 output is carried out into process of convolution and carries out cyclic convolution operation with the output of up-sampling layer. For

It should be noted that the "+" in the present embodiment represents cascade connection, subscript 1-1 represents the first of generalized convolution layer 1 Layer convolutional layer, subscript 1-2 represent the second layer convolutional layer of generalized convolution layer 1；Subscript i-j represents generalized convolution layer i jth layer volume Lamination, wherein i are that 1~5, j is 1~3.Subscript 6-1 represents the first kind " expansion " convolutional layer, subscript 6-2 of " expansion " convolutional layer Represent that the second class " expansion " convolutional layer of " expansion " convolutional layer, subscript 6-3 represent the 3rd class " expansion " volume of " expansion " convolutional layer Lamination, subscript 6-4 represent the 4th class " expansion " convolutional layer of " expansion " convolutional layer.

The details of cyclic convolution layer is as shown in Fig. 2 it can be regarded as increases on the basis of convolutional layer along time dimension Circulation has been added to connect, advantage of this is that：With the increase of training time, each convolutional layer is not in the case of parameter is increased Increase the local sensing scope to input, catch, merge local detail with can changing.As shown in Fig. 2 setting follows in the present invention The number of plies of ring connection is 3, and the hardware pressure of training is balanced while computational efficiency is ensured.After above-mentioned network is put up, Two road networks are trained respectively：

Split flow network for still image：We choose current published three authoritative image partitioned data set (bags Include ECSSD, MSRA 10K and PASCAL VOC 2012), they are collected to obtain 21582 pictures, and by random cropping, Data set is expanded to 10 by the operation such as mirror image, upset, zoom, affine transformation⁴Magnitude, mitigate and be likely to occur in training process Over-fitting.When training the network, decoded portion is first fixed, goes to train coded portion using 60% data；Deng coded portion After convergence, weights when going to train whole network, wherein coded portion use convergence before using 100% training data are initial Change, decoded portion random initializtion.

Split flow network for inter-frame information：Currently without disclosed extensive video segmentation data collection, thus we must Training dataset must be made manually.First collect the training video collection VID that VS detects in ILSVRC2015, these video sets There are the indicia framing of complete object detection, the position of Precise Representation object.Then the still image obtained using training is split Flow network does image to every frame of video set to be split to obtain prospect background segmentation result.Then the light stream of each video interframe is calculated Field simultaneously preserves optical flow field information corresponding to the every frame of video into RGB figures.Finally existing video is combined using a set of screening strategy The indicia framing screening of detection draws qualified frame and its segmentation result, is collected as training inter-frame information segmentation drift net The training data of network.

Screening strategy includes at 2 points：1) reliable segmentation result：The result for carrying out image segmentation per frame to video occupies The scope of object detection flag frame is 75% to 90%.2) reliable optical flow field information：The optical flow field RGB figures being calculated must Average light stream range value must be met between 5 to 100, very slow or motion quickly can all cause optical flow field information very inaccurate Really.

24960, which are finally given, by screening is available for the initial data of training (as shown in figure 3, occurring in screening process Part possible case and processing), and operated by random cropping, mirror image, upset, zoom, affine transformation etc. by data set It is expanded to 10⁴Magnitude, mitigate the over-fitting being likely to occur in training process.When training the network, decoded portion is first fixed, is used 60% data go to train coded portion；After restraining Deng coded portion, go to train whole network using 100% training data, Weight initialization when wherein being restrained before coded portion use, decoded portion random initializtion.

After above-mentioned two-way network training, last part-UNE is built.The network includes a company Connect layer and two generalized convolution layers (including convolutional layer and active coating).The concrete structure of UNE is as follows：Articulamentum, convolution Layer, active coating, convolutional layer, active coating.Wherein, still image is split flow network by articulamentum and inter-frame information segmentation flow network is straight Connect it is connected, to two-way export fusion treatment as final segmentation result.It is as complete that three networks constitute an entirety Video segmentation network.

Finally using the video segmentation data collection training UNE partly marked completely.Participate in training process is to have instructed Experienced still image segmentation flow network, the inter-frame information segmentation flow network trained and UNE to be trained composition it is whole Body.In training process, fixed still image segmentation flow network and inter-frame information segmentation flow network parameter do not update, and choose mark completely The parameter of part training set renewal UNE in the video segmentation data collection DAVIS of note, treats that training convergence is completed.

So far, the deep neural network that the unsupervised methods of video segmentation of proposition needs is ready for finishing.It is direct during test Use the network, it is not necessary to do any post processing work.Testing process is as follows：It is calculated first between frame of video and frame Optical flow field and processing obtain and per optical flow field RGB figures corresponding to frame.Then by each frame of video optical flow field RGB corresponding with its The inter-frame information segmentation flow network that the still image segmentation flow network and the 4th step that the synchronous input second step of figure obtains obtain.Finally The output of UNE is final segmentation result.

In order to embody the progressive of the present invention, the inventive method and currently representational unsupervised approaches and semi-supervised Method compares.The evaluation measures that most methods of video segmentation use at present are all except simultaneously using friendship (Intersection over U), formula is defined as follows：

IoU=100 × (S ∩ G)/(S ∪ G)

Wherein：S is the segmentation result that each algorithm obtains, and G is corresponding Standard Segmentation result.IoU is bigger, illustrates segmentation knot Fruit is better.

Table 1

Table 1 is the IoU results pair of the inventive method and other method in DAVIS and SegTrack v2 two datasets Than.Wherein：DAVIS data sets are data sets most authoritative at present, and picture is 480p and 1080p, and kind of object is more, and mark is clear Chu；SegTrack v2 data sets object all very littles, video resolution is than relatively low.Find out from the result of table：In DAVIS data sets On, the inventive method exceedes all unsupervised and semi-supervised methods, wherein lifting best unsupervised approaches FST14%, carries Best semi-supervised method is risen close to two points.It should be noted that semi-supervised method is since it is desired that the first frame or above some The accurate mark of frame, processing time are typically all long.Contrasted by OFL, the picture that OFL methods handle a 480p size needs Close to 2 minutes, and the inventive method only needed 0.2 second.On SegTrack v2 data sets, the inventive method is poor compared with OFL methods A bit, the reason for possible, is as follows：(1) resolution ratio of video is relatively low, and object is all smaller, is unfavorable for the deep learning of the present invention Method catch detailed information；(2) OFL methods are a kind of parametric methods, and this method is done for each video in an experiment Parameter optimization is to obtain best result, and by contrast, method of the invention did not carry out special domain optimization, all The network tested on video is all that pre-training is good.Fig. 4 is that the inventive method and FST methods, the segmentation result of OFL methods are straight See contrast, it can be seen that the inventive method preserves best, segmentation precision also highest at details.

The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make a variety of changes or change within the scope of the claims, this not shadow Ring the substantive content of the present invention.In the case where not conflicting, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims

A kind of 1. unsupervised methods of video segmentation based on deep learning, it is characterised in that including：

Coding and decoding deep neural network is established, the coding and decoding deep neural network includes：Still image segmentation flow network, Inter-frame information splits flow network and UNE；Wherein, the still image segmentation flow network is used to enter current video frame Row prospect background dividing processing, the inter-frame information segmentation flow network are used between the current video frame and next frame of video Optical flow field information carry out moving object prospect background segmentation；

The segmentation figure picture that the still image is split to flow network and inter-frame information segmentation flow network output passes through the fusion net After network is merged, Video segmentation result is obtained.
2. the unsupervised methods of video segmentation according to claim 1 based on deep learning, it is characterised in that the foundation Coding and decoding deep neural network, including：

Still image segmentation flow network is established, and the image by having carried out still image segmentation is split to the still image Flow network is trained；

Inter-frame information segmentation flow network is established, and the video by having carried out inter-frame information segmentation is split to the inter-frame information Flow network is trained；

The coding and decoding deep neural network is trained using the video segmentation data marked completely.
3. the unsupervised methods of video segmentation according to claim 2 based on deep learning, it is characterised in that the static state Image segmentation flow network includes：The coded portion and decoded portion that full convolutional network is formed, wherein,

The full convolutional network of coded portion includes：The generalized convolution layer of five levels connection and one with layer 5 generalized convolution level connection Layer expansion convolutional layer, the expansion convolutional layer positioned at layer 6 include the expansion of four class different scales, and one is formed per a kind of Road is exported, the average value of the output result on four classes output road is the output result of the coded portion；

The full convolutional network of decoded portion is：The full convolutional network that three layers of cyclic convolution layer and three layers of up-sampling layer are formed；It is described The full convolutional network of decoded portion, for exporting the picture segmentation result consistent with inputting photo resolution.
4. the unsupervised methods of video segmentation according to claim 3 based on deep learning, it is characterised in that coded portion Full convolutional network in five layers of generalized convolution layer include the first generalized convolution layer, the second generalized convolution layer, the 3rd broad sense of cascade Convolutional layer, the 4th generalized convolution layer, the 5th generalized convolution layer, wherein：

First generalized convolution layer includes successively：Convolutional layer A11, active coating, convolutional layer A12, active coating, pond layer；

Second generalized convolution layer includes successively：Convolutional layer A21, active coating, convolutional layer A22, active coating, pond layer；

3rd generalized convolution layer includes successively：Convolutional layer A31, active coating, convolutional layer A32, active coating, convolutional layer A33, activation Layer, pond layer；

4th generalized convolution layer includes successively：Convolutional layer A41, active coating, convolutional layer A42, active coating, convolutional layer A43, activation Layer, pond layer；

5th generalized convolution layer includes successively：Convolutional layer A51, active coating, convolutional layer A52, active coating, convolutional layer A53, activation Layer, pond layer；

The expansion convolutional layer joined in the full convolutional network of coded portion with layer 5 generalized convolution level includes：In parallel Four classes expand convolutional layer, wherein：

First kind expansion convolutional layer includes successively：First yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer；

Second class expansion convolutional layer includes successively：Second yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer；

3rd class expansion convolutional layer includes successively：3rd yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer；

4th class expansion convolutional layer includes successively：4th yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer.
5. the unsupervised methods of video segmentation according to claim 4 based on deep learning, it is characterised in that decoded portion Full convolutional network in, every layer up-sampling layer join with corresponding cyclic convolution level, wherein：

First up-sampling layer and the 3rd cyclic convolution level join, and the output that the first up-sampling layer is used for last layer carries out two Up-sample again；The 3rd cyclic convolution layer is used to coded portion convolutional layer A33 output carrying out process of convolution and and first The output for up-sampling layer carries out cyclic convolution operation；

Second up-sampling layer cascades with second circulation convolutional layer, and the output that the second up-sampling layer is used for last layer carries out two Up-sample again；The second circulation convolutional layer is used to coded portion convolutional layer A22 output carrying out process of convolution and and second The output for up-sampling layer carries out cyclic convolution operation；

3rd up-sampling layer cascades with first circulation convolutional layer, and the output that the 3rd up-sampling layer is used for last layer carries out two Up-sample again；The first circulation convolutional layer is used to coded portion convolutional layer A12 output carrying out process of convolution and the and the 3rd The output for up-sampling layer carries out cyclic convolution operation.
6. the unsupervised methods of video segmentation according to claim 3 based on deep learning, it is characterised in that described to pass through The image for having carried out still image segmentation is trained to still image segmentation flow network, including：

Choose ECSSD images partitioned data set, MSRA 10K images partitioned data sets and the images of PASCAL VOC 2012 segmentation number According to the samples pictures of concentration；

10 are expanded to after carrying out random cropping, mirror image, upset, zoom, affine transformation to the samples pictures⁴The number of magnitude Data bulk；

Fixed decoded portion, go to train coded portion using 60% data, until coded portion is restrained；

The still image segmentation flow network is trained using 100% training data；Wherein, when the coded portion is using convergence Weights initialized, decoded portion carry out random initializtion.
7. the unsupervised methods of video segmentation according to claim 2 based on deep learning, it is characterised in that the interframe Information segmentation flow network includes：The coded portion and decoded portion for the mutual cascade that full convolutional network is formed；Wherein：

The full convolutional network of coded portion includes：The generalized convolution layer of five levels connection and one with layer 5 generalized convolution level connection Layer expansion convolutional layer, the expansion convolutional layer positioned at layer 6 include the expansion of four class different scales, and one is formed per a kind of Road is exported, the average value of the output result on four classes output road is the output result of the coded portion；

Five layers of generalized convolution layer include the first generalized convolution layer of cascade, the second generalized convolution in the full convolutional network of coded portion Layer, the 3rd generalized convolution layer, the 4th generalized convolution layer, the 5th generalized convolution layer, wherein：

First generalized convolution layer includes successively：Convolutional layer B11, active coating, convolutional layer B12, active coating, pond layer；

Second generalized convolution layer includes successively：Convolutional layer B21, active coating, convolutional layer B22, active coating, pond layer；

3rd generalized convolution layer includes successively：Convolutional layer B31, active coating, convolutional layer B32, active coating, convolutional layer B33, activation Layer, pond layer；

4th generalized convolution layer includes successively：Convolutional layer B41, active coating, convolutional layer B42, active coating, convolutional layer B43, activation Layer, pond layer；

5th generalized convolution layer includes successively：Convolutional layer B51, active coating, convolutional layer B52, active coating, convolutional layer B53, activation Layer, pond layer；

The expansion convolutional layer joined in the full convolutional network of coded portion with layer 5 generalized convolution level includes：In parallel Four classes expand convolutional layer, wherein：

First kind expansion convolutional layer includes successively：First yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer；

Second class expansion convolutional layer includes successively：Second yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer；

3rd class expansion convolutional layer includes successively：3rd yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer；

4th class expansion convolutional layer includes successively：4th yardstick expansion convolutional layer, active coating, random drop layer, convolutional layer, activation Layer, random drop layer, convolutional layer；

The full convolutional network of decoded portion is：The full convolutional network that three layers of cyclic convolution layer and three layers of up-sampling layer are formed；It is described The full convolutional network of decoded portion, for exporting the picture segmentation result consistent with inputting photo resolution；Wherein：

In the full convolutional network of decoded portion, every layer of up-sampling layer joins with corresponding cyclic convolution level, wherein：

First up-sampling layer and the 3rd cyclic convolution level join, and the output that the first up-sampling layer is used for last layer carries out two Up-sample again；The 3rd cyclic convolution layer is used to coded portion convolutional layer B33 output carrying out process of convolution and and first The output for up-sampling layer carries out cyclic convolution operation；

Second up-sampling layer cascades with second circulation convolutional layer, and the output that the second up-sampling layer is used for last layer carries out two Up-sample again；The second circulation convolutional layer is used to coded portion convolutional layer B22 output carrying out process of convolution and and second The output for up-sampling layer carries out cyclic convolution operation；

3rd up-sampling layer cascades with first circulation convolutional layer, and the output that the 3rd up-sampling layer is used for last layer carries out two Up-sample again；The first circulation convolutional layer is used to coded portion convolutional layer B12 output carrying out process of convolution and the and the 3rd The output for up-sampling layer carries out cyclic convolution operation.
8. the unsupervised methods of video segmentation according to claim 7 based on deep learning, it is characterised in that described to pass through The video for having carried out inter-frame information segmentation is trained to inter-frame information segmentation flow network, including：

The training video collection VID that VS detects in ILSVRC2015 is collected, wherein, have in the training video collection VID The indicia framing of complete object detection；

The still image segmentation flow network obtained using training does image segmentation to video set VID every frame, obtains prospect background Segmentation result；

Calculate the optical flow field of each video interframe and preserve optical flow field information corresponding to the every frame of video into RGB and scheme；

The correct image segmentation of segmentation is filtered out with reference to the indicia framing in the training video collection VID according to default screening strategy As a result the initial training image as inter-frame information segmentation flow network；Wherein, the screening strategy meets following condition：

First：It is 75% to 90% to the video scope that the result of progress image segmentation occupies object detection flag frame per frame；

Second：The average light stream range value for the optical flow field RGB figures being calculated is between 5 to 100；

10 are expanded to after the initial training image is carried out into random cropping, mirror image, upset, zoom, affine transformation⁴Magnitude Data bulk；

Fixed decoded portion, go to train coded portion using 60% data, until coded portion is restrained；

The inter-frame information segmentation flow network is trained using 100% training data；Wherein, when the coded portion is using convergence Weights initialized, decoded portion carry out random initializtion.
9. the unsupervised methods of video segmentation according to claim 1 based on deep learning, it is characterised in that the fusion Network includes：Articulamentum, convolutional layer, active coating, convolutional layer, active coating；Wherein：

The articulamentum is used to connect the still image segmentation flow network and inter-frame information segmentation flow network, and passes through volume Lamination, active coating, convolutional layer, active coating split flow network to the still image and the inter-frame information splits the defeated of flow network Go out result to be merged, obtain final Video segmentation result.
10. the unsupervised methods of video segmentation based on deep learning according to claim any one of 2-9, its feature exist In, still image segmentation flow network and inter-frame information segmentation flow network carry out in the training process network parameter in real time more Newly.