CN107808389B

CN107808389B - Unsupervised video segmentation method based on deep learning

Info

Publication number: CN107808389B
Application number: CN201711004135.5A
Authority: CN
Inventors: 宋利; 许经纬; 解蓉; 张文军
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2017-10-24
Filing date: 2017-10-24
Publication date: 2020-04-17
Anticipated expiration: 2037-10-24
Also published as: CN107808389A

Abstract

The invention provides an unsupervised video segmentation method based on deep learning, which comprises the following steps: establishing a coding and decoding deep neural network, wherein the coding and decoding deep neural network comprises the following steps: a static image segmentation flow network, an interframe information segmentation flow network and a fusion network; the static image segmentation flow network is used for performing foreground and background segmentation processing on a current video frame, and the interframe information segmentation flow network is used for performing foreground and background segmentation on moving objects on optical flow field information between the current video frame and a next video frame; and fusing the segmentation images output by the static image segmentation stream network and the interframe information segmentation stream network through a fusion network to obtain a video segmentation result. The static image segmentation flow network is used for high-quality intra-frame segmentation, the interframe information segmentation flow network is used for high-quality optical flow field information segmentation, and the two-path output obtains an improved segmentation result through the final fusion operation, so that a better segmentation result can be obtained according to effective two-path output and fusion operation.

Description

Unsupervised video segmentation method based on deep learning

Technical Field

The invention relates to the technical field of video processing, in particular to an unsupervised video segmentation method based on deep learning.

Background

The video segmentation refers to a process of segmenting foreground and background of an object in each frame of a video to obtain a binary image, and has the difficulty that the confidentiality of spatial domain (intra-frame) segmentation is ensured, and the continuity of temporal domain (inter-frame information) segmentation is also ensured. High-quality video segmentation is the basis of video editing, video object recognition and video semantic analysis, and therefore has very important significance.

Existing video segmentation methods can be broadly classified into the following three categories according to their principles:

1) unsupervised traditional video segmentation method

The method does not need to manually mark key frames such as (first frame) information, and the general steps are image segmentation and inter-frame similar block matching to automatically segment a given video. For example, "Video segmentation by non-local segmentation v/ting" published by BMVC in 2014 by a.aktor and m.irani et al, each frame is processed to obtain some segmentations (object artifacts) that may include objects, then inter-frame similarity detection is performed based on the segmentations, and the segmentations with the highest similarity are screened as segmentation results. The method has the advantages that manual intervention is not needed, but a large number of segmentation intermediate forms such as superpixels (superpixels) and the like need to be calculated, and a large amount of time and storage space are consumed.

2) Traditional video segmentation method based on semi-supervision

This type of method generally requires manual work to mark the information of a key frame (such as the first frame or the first few frames), and then transmits the marked segmentation information to all subsequent frames by means of inter-frame transmission. As in "Video segmentation Video object flow" published by CVPR in 2016, y. h.tsai, m. h.yang and m.j.black et al, we propose to use the method of global graph to put all frames into one graph, the edges of the graph represent the similarity between frames, and finally pass the marked first frame to the following frames by solving the graph for segmentation. The method is the method with the highest accuracy rate in the traditional method, because the information of each frame is considered in the optimization process, but the time for calculating and dividing is greatly increased due to the difficulty of solving the global graph. It is also a commonality of this type of method-the segmentation accuracy is high but at the same time the computational complexity is also high.

3) Deep learning-based method

With the development of deep learning, the deep neural network obtains better results in the fields of image classification, segmentation, identification and the like, but the deep neural network is limited by higher redundancy in a time domain in the video field and does not fully play a strong role. The One-shot video object segmentation published by CVPR in 2017 by s.cells, k.manitis, j.pont-Tuset, l.lean-Taixe, d.creaters, and l.van Gool et al, proposes that video segmentation only requires single-frame segmentation of each frame of video, and does not depend on inter-frame information. They consider interframe information redundant, unnecessary, and in many cases inaccurate, and therefore they present solutions to train a strong image segmentation network, then accurately label the first frame or frames in front of it when segmenting a given video, use these frames to fine-tune (finetune) the large network, and finally use this network to segment other frames of the video. This method has the possibility of overfitting and cannot be applied to large-scale video segmentation scenarios.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an unsupervised video segmentation method based on deep learning.

The unsupervised video segmentation method based on deep learning provided by the invention comprises the following steps:

establishing a codec deep neural network, the codec deep neural network comprising: a static image segmentation flow network, an interframe information segmentation flow network and a fusion network; the static image segmentation flow network is used for performing foreground and background segmentation processing on a current video frame, and the inter-frame information segmentation flow network is used for performing foreground and background segmentation on optical flow field information between the current video frame and a next video frame;

and fusing the segmentation images output by the static image segmentation flow network and the interframe information segmentation flow network through the fusion network to obtain a video segmentation result.

Optionally, the establishing a codec deep neural network includes:

establishing a static image segmentation flow network, and training the static image segmentation flow network through an image subjected to static image segmentation;

establishing an interframe information segmentation flow network, and training the interframe information segmentation flow network through a video subjected to interframe information segmentation;

and training the coding and decoding deep neural network by using the fully labeled video segmentation data.

Optionally, the static image segmentation flow network comprises: an encoding part and a decoding part which are formed by a full convolution network, wherein,

the full convolutional network of the encoded portion comprises: the expansion convolutional layer positioned on the sixth layer comprises four types of expansions with different scales, each type forms an output path, and the average value of the output results of the four types of output paths is the output result of the coding part;

the full convolution network of the decoding part is: a full convolution network consisting of three layers of circulating convolution layers and three layers of upper sampling layers; and the full convolution network of the decoding part is used for outputting a picture segmentation result consistent with the resolution of the input picture.

Optionally, the five generalized convolutional layers in the full convolutional network of the coding part comprise a first generalized convolutional layer, a second generalized convolutional layer, a third generalized convolutional layer, a fourth generalized convolutional layer, and a fifth generalized convolutional layer which are cascaded, wherein:

the first generalized convolutional layer sequentially comprises: convolutional layer A11, active layer, convolutional layer A12, active layer, pooling layer;

the second generalized convolutional layer sequentially includes: convolutional layer A21, active layer, convolutional layer A22, active layer, pooling layer;

the third generalized convolutional layer sequentially includes: convolutional layer A31, active layer, convolutional layer A32, active layer, convolutional layer A33, active layer, pooling layer;

the fourth generalized convolutional layer sequentially includes: convolutional layer A41, active layer, convolutional layer A42, active layer, convolutional layer A43, active layer, pooling layer;

the fifth generalized convolutional layer sequentially comprises: convolutional layer A51, active layer, convolutional layer A52, active layer, convolutional layer A53, active layer, pooling layer;

the expanded convolutional layer cascaded with the fifth generalized convolutional layer in the full convolutional network of the coding part comprises: four types of expanded convolutional layers in parallel, wherein:

the first type of expanded convolutional layer sequentially comprises: a first scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;

the second type of expanded convolutional layer sequentially comprises: a second scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;

the third type of expansion convolution layer comprises in sequence: a third scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;

the fourth type of expansion convolution layer comprises in sequence: a fourth scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer.

Optionally, in the full convolutional network of the decoding part, each upsampling layer is cascaded with a corresponding cyclic convolutional layer, wherein:

a first upsampling layer is cascaded with a third cyclic convolution layer, and the first upsampling layer is used for performing double upsampling on the output of the previous layer; the third cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer A33 and performing cyclic convolution operation on the output of the first up-sampling layer;

the second up-sampling layer is cascaded with the second circulating convolution layer, and the second up-sampling layer is used for performing double up-sampling on the output of the previous layer; the second cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer A22 and performing cyclic convolution operation on the output of the second up-sampling layer;

a third upsampling layer is cascaded with the first cyclic convolution layer, and the third upsampling layer is used for performing double upsampling on the output of the previous layer; the first cyclic convolution layer is used to convolve the output of the encoded partial convolution layer a12 with the output of the third upsampled layer.

Optionally, the training of the static image segmentation flow network by the image that has been subjected to static image segmentation includes:

selecting sample pictures in an ECSSD image segmentation data set, an MSRA 10K image segmentation data set and a PASCAL VOC 2012 image segmentation data set;

randomly cutting, mirroring, turning, amplifying and reducing the sample picture, performing affine transformation, and expanding to 10⁴Magnitude of data;

fixing the decoding part, and training the coding part by using 60% of data until the coding part converges;

training the static image segmentation flow network using 100% of training data; the encoding part uses the weight value in convergence to carry out initialization, and the decoding part carries out random initialization.

Optionally, the inter-frame information division streaming network includes: the encoding part and the decoding part are cascaded with each other and formed by a full convolution network; wherein:

five generalized convolutional layers in the full convolutional network of the coding part comprise a first generalized convolutional layer, a second generalized convolutional layer, a third generalized convolutional layer, a fourth generalized convolutional layer and a fifth generalized convolutional layer which are cascaded, wherein:

the first generalized convolutional layer sequentially comprises: convolutional layer B11, active layer, convolutional layer B12, active layer, pooling layer;

the second generalized convolutional layer sequentially includes: convolutional layer B21, active layer, convolutional layer B22, active layer, pooling layer;

the third generalized convolutional layer sequentially includes: convolutional layer B31, active layer, convolutional layer B32, active layer, convolutional layer B33, active layer, pooling layer;

the fourth generalized convolutional layer sequentially includes: convolutional layer B41, active layer, convolutional layer B42, active layer, convolutional layer B43, active layer, pooling layer;

the fifth generalized convolutional layer sequentially comprises: convolutional layer B51, active layer, convolutional layer B52, active layer, convolutional layer B53, active layer, pooling layer;

the fourth type of expansion convolution layer comprises in sequence: a fourth scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;

the full convolution network of the decoding part is: a full convolution network consisting of three layers of circulating convolution layers and three layers of upper sampling layers; the full convolution network of the decoding part is used for outputting a picture segmentation result consistent with the resolution of an input picture; wherein:

in a full convolutional network of a decoding part, each upsampling layer is cascaded with a corresponding cyclic convolutional layer, wherein:

a first upsampling layer is cascaded with a third cyclic convolution layer, and the first upsampling layer is used for performing double upsampling on the output of the previous layer; the third cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer B33 and performing cyclic convolution operation on the output of the first up-sampling layer;

the second up-sampling layer is cascaded with the second circulating convolution layer, and the second up-sampling layer is used for performing double up-sampling on the output of the previous layer; the second cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer B22 and performing cyclic convolution operation on the output of the second up-sampling layer;

a third upsampling layer is cascaded with the first cyclic convolution layer, and the third upsampling layer is used for performing double upsampling on the output of the previous layer; the first cyclic convolution layer is used to convolve the output of the encoded partial convolution layer B12 with the output of the third upsampling layer.

Optionally, the training the interframe information segmentation stream network by the video that has been subjected to interframe information segmentation includes:

collecting a training video set VID for video object detection in ILSVRC2015, wherein all the training video sets VID have complete object detection mark frames;

performing image segmentation on each frame of the video set VID by using a static image segmentation flow network obtained by training to obtain a foreground and background segmentation result;

calculating an optical flow field between each video frame and storing the optical flow field information corresponding to each video frame into an RGB (red, green and blue) graph;

screening out an image segmentation result with correct segmentation as an initial training image of the interframe information segmentation flow network according to a preset screening strategy and by combining a mark frame in the training video set VID; wherein the screening strategy satisfies the following conditions:

firstly, the method comprises the following steps: the result of image segmentation on each frame of the video occupies 75% to 90% of the object detection marker frame;

secondly, the method comprises the following steps: calculating the average optical flow amplitude value of the RGB image of the optical flow field to be between 5 and 100;

randomly cutting, mirroring, turning, amplifying and reducing the initial training image, performing affine transformation, and expanding to 10⁴Magnitude of data;

training the interframe information segmentation flow network by using 100% of training data; the encoding part uses the weight value in convergence to carry out initialization, and the decoding part carries out random initialization.

Optionally, the converged network includes: a connection layer, a convolution layer, an activation layer, a convolution layer, and an activation layer; wherein:

the connection layer is used for connecting the static image segmentation stream network and the interframe information segmentation stream network, and fusing output results of the static image segmentation stream network and the interframe information segmentation stream network through a convolution layer, an activation layer, a convolution layer and an activation layer to obtain a final video segmentation result.

Optionally, the static image segmentation flow network and the interframe information segmentation flow network perform real-time update of network parameters in a training process.

Compared with the prior art, the invention has the following beneficial effects:

the unsupervised video segmentation method based on deep learning provided by the invention comprises the steps of constructing a double-flow video segmentation network comprising a static image segmentation flow network and an interframe information segmentation flow network, wherein the static image segmentation flow network is used for high-quality intraframe segmentation, the interframe information segmentation flow network is used for high-quality optical flow field information segmentation, and two paths of output obtain an improved segmentation result through final fusion operation; when the problems that the traditional methods such as shielding, slow movement and the like cannot completely solve exist, the method can still obtain a better segmentation result according to effective double-path output and fusion operation.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of an unsupervised video segmentation method based on deep learning according to the present invention;

FIG. 2 is a schematic diagram of a cyclic convolution layer in a decoding network used in the present invention;

FIG. 3 is a schematic diagram illustrating the effect of the screening strategy for generating a data set required by the interframe information segmentation flow network training according to the present invention;

FIG. 4 is a graph comparing results of the currently best unsupervised and supervised methods of embodiments of the present invention. Among them, Fast Object Segmentation in unorganized Video (FST) method and Object streaming Video Segmentation (OFL) method are the best unsupervised and semi-supervised methods at present, respectively.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, the present embodiment provides an unsupervised video segmentation method based on deep learning, and the details of the embodiment are as follows, and the following embodiments are not described in detail with reference to the summary of the invention.

Firstly, two networks are built, including a static image segmentation flow network and an interframe information segmentation flow network. The two networks have the same structure, and are based on an encoding-decoding architecture: the coding part is a full convolution network, comprising five generalized convolution layers (the first three layers are provided with convolution layers, a pooling layer and an activation layer, and the second two layers are not provided with a pooling layer) and the last expanded convolution layer. Wherein the last layer is divided into 4 types of expansion with different scales, each type forms a path, and the output result of the coding part is the average of the 4 paths of output results; the decoding part is also a full convolution network, which is connected after the encoding part and comprises three cyclic convolution layers and three upsampling layers. And finally, the output results of the two paths are consistent with the input size. The details of the encoding and decoding portions are as follows:

the specific structure of the coding part is as follows: (the generalized convolutional layers 1 to 5 listed below are all cascaded, the parallel operation is performed between 4 of the 6 th layer, and the cascade operation is performed between the 6 th layer and the 5 th layer)

Generalized convolutional layer 1: 1-1+ activation layer + 1-2+ activation layer + pooling layer of convolution layer;

generalized convolutional layer 2: 2-1+ activation layer + 2-2+ activation layer + pooling layer of convolution layer;

generalized convolution layer 3: the convolution layer is 3-1+ the active layer + the convolution layer is 3-2+ the active layer + the convolution layer is 3-3+ the active layer + the pooling layer;

generalized convolutional layer 4: a convolutional layer 4-1+ an active layer + a convolutional layer 4-2+ an active layer + a convolutional layer 4-3+ an active layer + a pooling layer;

generalized convolutional layer 5: 5-1+ of the convolution layer, 5-2+ of the convolution layer, 5-3+ of the convolution layer, and the activation layer and the pooling layer;

"expanded" convolutional layer 6-1: "expand" convolutional layer (displacement ═ 6) + active layer + random discard layer (dropout) + convolutional layer;

"expanded" convolutional layer 6-2: "expand" convolutional layer (displacement ═ 12) + active layer + random discard layer (dropout) + convolutional layer;

"expanded" convolutional layer 6-3: "expand" convolutional layer (displacement ═ 18) + active layer + random discard layer (dropout) + convolutional layer;

"expanded" convolutional layer 6-4: "expand" convolutional layer (displacement ═ 24) + active layer + random discard layer (dropout) + convolutional layer;

the specific structure of the decoding part is as follows: (the upsampling layer + cyclic convolution layers 3 to 1 listed below are all cascade operations)

Upsampling layer + cyclic convolution layer 3: the upsampling layer is twice the upsampling of the output of the previous layer. The cyclic convolution layer 3 is a layer in which the output of the encoded partial convolution layer 3-3 is subjected to convolution processing and is subjected to cyclic convolution operation with the output of the up-sampling layer.

Upsampling layer + cyclic convolution layer 2: the upsampling layer is twice the upsampling of the output of the previous layer. The cyclic convolution layer 2 is a layer in which the output of the encoded partial convolution layer 2-2 is subjected to convolution processing and is subjected to cyclic convolution operation with the output of the up-sampling layer.

Upsampling layer + cyclic convolution layer 1: the upsampling layer is twice the upsampling of the output of the previous layer. The cyclic convolution layer 1 is a layer in which the output of the encoded partial convolution layer 1-2 is subjected to convolution processing and is subjected to cyclic convolution operation with the output of the up-sampling layer. Is composed of

In this embodiment, "+" indicates a cascade relationship, subscripts 1-1 indicate the first-layer convolutional layer of the generalized convolutional layer 1, and subscripts 1-2 indicate the second-layer convolutional layer of the generalized convolutional layer 1; the subscript i-j represents the jth convolutional layer of the generalized convolutional layer i, wherein i is 1-5 and j is 1-3. Subscript 6-1 represents the first type of "expanded" convolutional layer of the "expanded" convolutional layer, subscript 6-2 represents the second type of "expanded" convolutional layer of the "expanded" convolutional layer, subscript 6-3 represents the third type of "expanded" convolutional layer of the "expanded" convolutional layer, and subscript 6-4 represents the fourth type of "expanded" convolutional layer of the "expanded" convolutional layer.

The details of the cyclic convolutional layer are shown in fig. 2, which can be seen as adding cyclic links along the time dimension on the basis of the convolutional layer, which has the advantages of: as the training time increases, each convolution layer increases the local perception range of the input under the condition of not increasing parameters, and local details can be captured and fused in an interchangeable way. As shown in fig. 2, the number of layers of cyclic connection is set to 3 in the present invention, which balances the hardware stress of training while ensuring the computational efficiency. After the networks are built, two paths of networks are trained respectively:

for still image split stream networks: three authoritative image segmentation data sets (including ECSSD, MSRA 10K and PASCAL VOC 2012) which are disclosed currently are selected, collected to obtain 21582 pictures, and the data sets are expanded to 10 degrees through operations of random cutting, mirroring, turning, zooming in and out, affine transformation and the like⁴And the magnitude reduces the overfitting possibly occurring in the training process. When training the network, firstly fixing the decoding part, and using 60% of data to train the coding part; and after the coding part converges, training the whole network by using 100% of training data, wherein the coding part is initialized by using the weight value before convergence, and the decoding part is initialized randomly.

For an interframe information split stream network: there is currently no large-scale video segmentation dataset disclosed, so we have to make the training dataset manually. The training video sets VID for video object detection in the ILSVRC2015 are collected first, and all the video sets have complete object detection marker boxes to accurately represent the positions of the objects. And then carrying out image segmentation on each frame of the video set by using the static image segmentation flow network obtained by training to obtain a foreground and background segmentation result. And then calculating the optical flow field between each video frame and storing the optical flow field information corresponding to each video frame into an RGB (red, green and blue) graph. And finally, screening the frames meeting the conditions and the segmentation results thereof by using a set of screening strategies and combining the existing marking frames of the video detection, and summarizing the frames to be used as training data of the training interframe information segmentation flow network.

The screening strategy includes two points: 1) reliable segmentation results: that is, the result of image segmentation for each frame of video occupies a range of 75% to 90% of the object detection mark frame. 2) Reliable optical flow field information: i.e. the calculated RGB-map of the optical flow field must satisfy a mean optical flow amplitude value between 5 and 100, very slow or fast movements will result in very inaccurate optical flow field information.

24960 training data is obtained by screening (as shown in FIG. 3, part of possible situations and processes occurring in the screening process), and the data set is expanded to 10 by random cropping, mirroring, flipping, zooming, affine transformation, etc⁴And the magnitude reduces the overfitting possibly occurring in the training process. When training the network, firstly fixing the decoding part, and using 60% of data to train the coding part; and after the coding part converges, training the whole network by using 100% of training data, wherein the coding part is initialized by using the weight value before convergence, and the decoding part is initialized randomly.

And after the two paths of networks are trained, building a final part, namely a fusion network. The network includes a connection layer and two generalized convolutional layers (including convolutional layer and active layer). The specific structure of the converged network is as follows: a connection layer, a convolution layer, an activation layer, a convolution layer, and an activation layer. The connection layer directly connects the static image segmentation flow network and the interframe information segmentation flow network, and outputs and fuses the two paths to be used as a final segmentation result. The three networks form a whole, namely a complete video segmentation network.

And finally, training the fusion network by using the partially and completely labeled video segmentation data set. The training process is carried out by the whole composed of the trained static image segmentation flow network, the trained interframe information segmentation flow network and the fusion network to be trained. In the training process, parameters of the fixed static image segmentation flow network and the interframe information segmentation flow network are not updated, a part of training sets in the completely labeled video segmentation data set DAVIS are selected to update parameters of the fusion network, and the training convergence is completed.

So far, the deep neural network required by the proposed unsupervised video segmentation method is prepared. The network can be directly used during testing without any post-processing work. The test flow is as follows: firstly, an optical flow field between video frames is obtained through calculation and an optical flow field RGB image corresponding to each frame is obtained through processing. Then, each frame of the video and the corresponding optical flow field RGB image are synchronously input into the static image segmentation flow network obtained in the second step and the interframe information segmentation flow network obtained in the fourth step. And finally, the output of the fusion network is the final segmentation result.

To demonstrate the advancement of the present invention, the method of the present invention was compared to currently representative unsupervised and semi-supervised methods. At present, most of the evaluation methods adopted by video segmentation methods use Intersection over U (Intersection over U), and the formula is defined as follows:

IoU＝100×(S∩G)/(S∪G)

wherein: s is a segmentation result obtained by each algorithm, and G is a corresponding standard segmentation result. IoU being larger indicates better segmentation results.

TABLE 1

Table 1 compares the IoU results of the present method and other methods on both DAVIS and SegTrack v2 data sets. Wherein: the DAVIS data set is the most authoritative data set at present, pictures are 480p and 1080p, the variety of objects is multiple, and the labels are clear; the SegTrack v2 dataset objects were small and the video resolution was low. As seen from the results of the table: on the DAVIS dataset, the inventive method outperforms all unsupervised and semi-supervised methods, with the best unsupervised method FST being lifted 14%, and the best semi-supervised method being lifted close to two points. It should be noted that the semi-supervised method generally requires a long processing time because it requires accurate labeling of the first frame or the previous frames. In comparison with OFL, the OFL method needs approximately 2 minutes for processing a picture with 480p size, while the method only needs 0.2 seconds. On the SegTrack v2 dataset, the method of the invention is a bit worse than the OFL method, for the following possible reasons: (1) the resolution ratio of the video is low, and objects are small, so that the method for deep learning is not favorable for capturing detailed information; (2) the OFL method is a parameterized method that is optimized for each video in the experiment to obtain the best result, in contrast to the method of the present invention that is not optimized for a specific domain, the network used on all experimental videos is pre-trained. Fig. 4 is a visual comparison of the segmentation results of the method of the present invention, the FST method, and the OFL method, and it can be seen that the method of the present invention is best preserved in detail and has the highest segmentation accuracy.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. An unsupervised video segmentation method based on deep learning is characterized by comprising the following steps:

2. The method of claim 1, wherein the building a codec deep neural network comprises:

3. The unsupervised video segmentation method based on deep learning of claim 2 wherein the static image segmentation stream network comprises: an encoding part and a decoding part which are formed by a full convolution network, wherein,

4. The unsupervised video segmentation method based on deep learning of claim 3 wherein five generalized convolutional layers in the full convolutional network of the coding portion comprise a first generalized convolutional layer, a second generalized convolutional layer, a third generalized convolutional layer, a fourth generalized convolutional layer, a fifth generalized convolutional layer which are cascaded, wherein:

5. The method of claim 4, wherein each upsampled layer is cascaded with a corresponding cyclic convolutional layer in a full convolutional network of the decoding part, wherein:

6. The unsupervised video segmentation method based on deep learning of claim 3 wherein the training of the still image segmentation flow network by images that have been still image segmented comprises:

randomly cutting, mirroring, turning, amplifying and reducing the sample picture, performing affine transformation, and expanding to 10⁴The magnitude of data is used as training data;

fixing the decoding part, and training the coding part by using 60% of training data until the coding part converges;

training the static image segmentation flow network by using 100% of data; the encoding part uses the weight value in convergence to carry out initialization, and the decoding part carries out random initialization.

7. The method of claim 2, wherein the inter-frame information segmentation stream network comprises: the encoding part and the decoding part are cascaded with each other and formed by a full convolution network; wherein:

8. The unsupervised video segmentation method based on deep learning of claim 7, wherein the training of the interframe information segmentation streaming network by the video that has been subjected to interframe information segmentation comprises:

randomly cutting, mirroring, turning, amplifying and reducing the initial training image, performing affine transformation, and expanding to 10⁴The magnitude of data is used as training data;

9. The method of claim 1, wherein the converged network comprises: a connection layer, a convolution layer, an activation layer, a convolution layer, and an activation layer; wherein:

10. The unsupervised video segmentation method based on deep learning of any of claims 2-9, wherein the static image segmentation stream network and the interframe information segmentation stream network perform real-time updating of network parameters during training.