CN107808389B - Unsupervised video segmentation method based on deep learning - Google Patents

Unsupervised video segmentation method based on deep learning Download PDF

Info

Publication number
CN107808389B
CN107808389B CN201711004135.5A CN201711004135A CN107808389B CN 107808389 B CN107808389 B CN 107808389B CN 201711004135 A CN201711004135 A CN 201711004135A CN 107808389 B CN107808389 B CN 107808389B
Authority
CN
China
Prior art keywords
layer
convolutional
network
segmentation
convolutional layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711004135.5A
Other languages
Chinese (zh)
Other versions
CN107808389A (en
Inventor
宋利
许经纬
解蓉
张文军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201711004135.5A priority Critical patent/CN107808389B/en
Publication of CN107808389A publication Critical patent/CN107808389A/en
Application granted granted Critical
Publication of CN107808389B publication Critical patent/CN107808389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Abstract

The invention provides an unsupervised video segmentation method based on deep learning, which comprises the following steps: establishing a coding and decoding deep neural network, wherein the coding and decoding deep neural network comprises the following steps: a static image segmentation flow network, an interframe information segmentation flow network and a fusion network; the static image segmentation flow network is used for performing foreground and background segmentation processing on a current video frame, and the interframe information segmentation flow network is used for performing foreground and background segmentation on moving objects on optical flow field information between the current video frame and a next video frame; and fusing the segmentation images output by the static image segmentation stream network and the interframe information segmentation stream network through a fusion network to obtain a video segmentation result. The static image segmentation flow network is used for high-quality intra-frame segmentation, the interframe information segmentation flow network is used for high-quality optical flow field information segmentation, and the two-path output obtains an improved segmentation result through the final fusion operation, so that a better segmentation result can be obtained according to effective two-path output and fusion operation.

Description

Unsupervised video segmentation method based on deep learning
Technical Field
The invention relates to the technical field of video processing, in particular to an unsupervised video segmentation method based on deep learning.
Background
The video segmentation refers to a process of segmenting foreground and background of an object in each frame of a video to obtain a binary image, and has the difficulty that the confidentiality of spatial domain (intra-frame) segmentation is ensured, and the continuity of temporal domain (inter-frame information) segmentation is also ensured. High-quality video segmentation is the basis of video editing, video object recognition and video semantic analysis, and therefore has very important significance.
Existing video segmentation methods can be broadly classified into the following three categories according to their principles:
1) unsupervised traditional video segmentation method
The method does not need to manually mark key frames such as (first frame) information, and the general steps are image segmentation and inter-frame similar block matching to automatically segment a given video. For example, "Video segmentation by non-local segmentation v/ting" published by BMVC in 2014 by a.aktor and m.irani et al, each frame is processed to obtain some segmentations (object artifacts) that may include objects, then inter-frame similarity detection is performed based on the segmentations, and the segmentations with the highest similarity are screened as segmentation results. The method has the advantages that manual intervention is not needed, but a large number of segmentation intermediate forms such as superpixels (superpixels) and the like need to be calculated, and a large amount of time and storage space are consumed.
2) Traditional video segmentation method based on semi-supervision
This type of method generally requires manual work to mark the information of a key frame (such as the first frame or the first few frames), and then transmits the marked segmentation information to all subsequent frames by means of inter-frame transmission. As in "Video segmentation Video object flow" published by CVPR in 2016, y. h.tsai, m. h.yang and m.j.black et al, we propose to use the method of global graph to put all frames into one graph, the edges of the graph represent the similarity between frames, and finally pass the marked first frame to the following frames by solving the graph for segmentation. The method is the method with the highest accuracy rate in the traditional method, because the information of each frame is considered in the optimization process, but the time for calculating and dividing is greatly increased due to the difficulty of solving the global graph. It is also a commonality of this type of method-the segmentation accuracy is high but at the same time the computational complexity is also high.
3) Deep learning-based method
With the development of deep learning, the deep neural network obtains better results in the fields of image classification, segmentation, identification and the like, but the deep neural network is limited by higher redundancy in a time domain in the video field and does not fully play a strong role. The One-shot video object segmentation published by CVPR in 2017 by s.cells, k.manitis, j.pont-Tuset, l.lean-Taixe, d.creaters, and l.van Gool et al, proposes that video segmentation only requires single-frame segmentation of each frame of video, and does not depend on inter-frame information. They consider interframe information redundant, unnecessary, and in many cases inaccurate, and therefore they present solutions to train a strong image segmentation network, then accurately label the first frame or frames in front of it when segmenting a given video, use these frames to fine-tune (finetune) the large network, and finally use this network to segment other frames of the video. This method has the possibility of overfitting and cannot be applied to large-scale video segmentation scenarios.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an unsupervised video segmentation method based on deep learning.
The unsupervised video segmentation method based on deep learning provided by the invention comprises the following steps:
establishing a codec deep neural network, the codec deep neural network comprising: a static image segmentation flow network, an interframe information segmentation flow network and a fusion network; the static image segmentation flow network is used for performing foreground and background segmentation processing on a current video frame, and the inter-frame information segmentation flow network is used for performing foreground and background segmentation on optical flow field information between the current video frame and a next video frame;
and fusing the segmentation images output by the static image segmentation flow network and the interframe information segmentation flow network through the fusion network to obtain a video segmentation result.
Optionally, the establishing a codec deep neural network includes:
establishing a static image segmentation flow network, and training the static image segmentation flow network through an image subjected to static image segmentation;
establishing an interframe information segmentation flow network, and training the interframe information segmentation flow network through a video subjected to interframe information segmentation;
and training the coding and decoding deep neural network by using the fully labeled video segmentation data.
Optionally, the static image segmentation flow network comprises: an encoding part and a decoding part which are formed by a full convolution network, wherein,
the full convolutional network of the encoded portion comprises: the expansion convolutional layer positioned on the sixth layer comprises four types of expansions with different scales, each type forms an output path, and the average value of the output results of the four types of output paths is the output result of the coding part;
the full convolution network of the decoding part is: a full convolution network consisting of three layers of circulating convolution layers and three layers of upper sampling layers; and the full convolution network of the decoding part is used for outputting a picture segmentation result consistent with the resolution of the input picture.
Optionally, the five generalized convolutional layers in the full convolutional network of the coding part comprise a first generalized convolutional layer, a second generalized convolutional layer, a third generalized convolutional layer, a fourth generalized convolutional layer, and a fifth generalized convolutional layer which are cascaded, wherein:
the first generalized convolutional layer sequentially comprises: convolutional layer A11, active layer, convolutional layer A12, active layer, pooling layer;
the second generalized convolutional layer sequentially includes: convolutional layer A21, active layer, convolutional layer A22, active layer, pooling layer;
the third generalized convolutional layer sequentially includes: convolutional layer A31, active layer, convolutional layer A32, active layer, convolutional layer A33, active layer, pooling layer;
the fourth generalized convolutional layer sequentially includes: convolutional layer A41, active layer, convolutional layer A42, active layer, convolutional layer A43, active layer, pooling layer;
the fifth generalized convolutional layer sequentially comprises: convolutional layer A51, active layer, convolutional layer A52, active layer, convolutional layer A53, active layer, pooling layer;
the expanded convolutional layer cascaded with the fifth generalized convolutional layer in the full convolutional network of the coding part comprises: four types of expanded convolutional layers in parallel, wherein:
the first type of expanded convolutional layer sequentially comprises: a first scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the second type of expanded convolutional layer sequentially comprises: a second scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the third type of expansion convolution layer comprises in sequence: a third scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the fourth type of expansion convolution layer comprises in sequence: a fourth scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer.
Optionally, in the full convolutional network of the decoding part, each upsampling layer is cascaded with a corresponding cyclic convolutional layer, wherein:
a first upsampling layer is cascaded with a third cyclic convolution layer, and the first upsampling layer is used for performing double upsampling on the output of the previous layer; the third cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer A33 and performing cyclic convolution operation on the output of the first up-sampling layer;
the second up-sampling layer is cascaded with the second circulating convolution layer, and the second up-sampling layer is used for performing double up-sampling on the output of the previous layer; the second cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer A22 and performing cyclic convolution operation on the output of the second up-sampling layer;
a third upsampling layer is cascaded with the first cyclic convolution layer, and the third upsampling layer is used for performing double upsampling on the output of the previous layer; the first cyclic convolution layer is used to convolve the output of the encoded partial convolution layer a12 with the output of the third upsampled layer.
Optionally, the training of the static image segmentation flow network by the image that has been subjected to static image segmentation includes:
selecting sample pictures in an ECSSD image segmentation data set, an MSRA 10K image segmentation data set and a PASCAL VOC 2012 image segmentation data set;
randomly cutting, mirroring, turning, amplifying and reducing the sample picture, performing affine transformation, and expanding to 104Magnitude of data;
fixing the decoding part, and training the coding part by using 60% of data until the coding part converges;
training the static image segmentation flow network using 100% of training data; the encoding part uses the weight value in convergence to carry out initialization, and the decoding part carries out random initialization.
Optionally, the inter-frame information division streaming network includes: the encoding part and the decoding part are cascaded with each other and formed by a full convolution network; wherein:
the full convolutional network of the encoded portion comprises: the expansion convolutional layer positioned on the sixth layer comprises four types of expansions with different scales, each type forms an output path, and the average value of the output results of the four types of output paths is the output result of the coding part;
five generalized convolutional layers in the full convolutional network of the coding part comprise a first generalized convolutional layer, a second generalized convolutional layer, a third generalized convolutional layer, a fourth generalized convolutional layer and a fifth generalized convolutional layer which are cascaded, wherein:
the first generalized convolutional layer sequentially comprises: convolutional layer B11, active layer, convolutional layer B12, active layer, pooling layer;
the second generalized convolutional layer sequentially includes: convolutional layer B21, active layer, convolutional layer B22, active layer, pooling layer;
the third generalized convolutional layer sequentially includes: convolutional layer B31, active layer, convolutional layer B32, active layer, convolutional layer B33, active layer, pooling layer;
the fourth generalized convolutional layer sequentially includes: convolutional layer B41, active layer, convolutional layer B42, active layer, convolutional layer B43, active layer, pooling layer;
the fifth generalized convolutional layer sequentially comprises: convolutional layer B51, active layer, convolutional layer B52, active layer, convolutional layer B53, active layer, pooling layer;
the expanded convolutional layer cascaded with the fifth generalized convolutional layer in the full convolutional network of the coding part comprises: four types of expanded convolutional layers in parallel, wherein:
the first type of expanded convolutional layer sequentially comprises: a first scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the second type of expanded convolutional layer sequentially comprises: a second scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the third type of expansion convolution layer comprises in sequence: a third scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the fourth type of expansion convolution layer comprises in sequence: a fourth scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the full convolution network of the decoding part is: a full convolution network consisting of three layers of circulating convolution layers and three layers of upper sampling layers; the full convolution network of the decoding part is used for outputting a picture segmentation result consistent with the resolution of an input picture; wherein:
in a full convolutional network of a decoding part, each upsampling layer is cascaded with a corresponding cyclic convolutional layer, wherein:
a first upsampling layer is cascaded with a third cyclic convolution layer, and the first upsampling layer is used for performing double upsampling on the output of the previous layer; the third cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer B33 and performing cyclic convolution operation on the output of the first up-sampling layer;
the second up-sampling layer is cascaded with the second circulating convolution layer, and the second up-sampling layer is used for performing double up-sampling on the output of the previous layer; the second cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer B22 and performing cyclic convolution operation on the output of the second up-sampling layer;
a third upsampling layer is cascaded with the first cyclic convolution layer, and the third upsampling layer is used for performing double upsampling on the output of the previous layer; the first cyclic convolution layer is used to convolve the output of the encoded partial convolution layer B12 with the output of the third upsampling layer.
Optionally, the training the interframe information segmentation stream network by the video that has been subjected to interframe information segmentation includes:
collecting a training video set VID for video object detection in ILSVRC2015, wherein all the training video sets VID have complete object detection mark frames;
performing image segmentation on each frame of the video set VID by using a static image segmentation flow network obtained by training to obtain a foreground and background segmentation result;
calculating an optical flow field between each video frame and storing the optical flow field information corresponding to each video frame into an RGB (red, green and blue) graph;
screening out an image segmentation result with correct segmentation as an initial training image of the interframe information segmentation flow network according to a preset screening strategy and by combining a mark frame in the training video set VID; wherein the screening strategy satisfies the following conditions:
firstly, the method comprises the following steps: the result of image segmentation on each frame of the video occupies 75% to 90% of the object detection marker frame;
secondly, the method comprises the following steps: calculating the average optical flow amplitude value of the RGB image of the optical flow field to be between 5 and 100;
randomly cutting, mirroring, turning, amplifying and reducing the initial training image, performing affine transformation, and expanding to 104Magnitude of data;
fixing the decoding part, and training the coding part by using 60% of data until the coding part converges;
training the interframe information segmentation flow network by using 100% of training data; the encoding part uses the weight value in convergence to carry out initialization, and the decoding part carries out random initialization.
Optionally, the converged network includes: a connection layer, a convolution layer, an activation layer, a convolution layer, and an activation layer; wherein:
the connection layer is used for connecting the static image segmentation stream network and the interframe information segmentation stream network, and fusing output results of the static image segmentation stream network and the interframe information segmentation stream network through a convolution layer, an activation layer, a convolution layer and an activation layer to obtain a final video segmentation result.
Optionally, the static image segmentation flow network and the interframe information segmentation flow network perform real-time update of network parameters in a training process.
Compared with the prior art, the invention has the following beneficial effects:
the unsupervised video segmentation method based on deep learning provided by the invention comprises the steps of constructing a double-flow video segmentation network comprising a static image segmentation flow network and an interframe information segmentation flow network, wherein the static image segmentation flow network is used for high-quality intraframe segmentation, the interframe information segmentation flow network is used for high-quality optical flow field information segmentation, and two paths of output obtain an improved segmentation result through final fusion operation; when the problems that the traditional methods such as shielding, slow movement and the like cannot completely solve exist, the method can still obtain a better segmentation result according to effective double-path output and fusion operation.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of an unsupervised video segmentation method based on deep learning according to the present invention;
FIG. 2 is a schematic diagram of a cyclic convolution layer in a decoding network used in the present invention;
FIG. 3 is a schematic diagram illustrating the effect of the screening strategy for generating a data set required by the interframe information segmentation flow network training according to the present invention;
FIG. 4 is a graph comparing results of the currently best unsupervised and supervised methods of embodiments of the present invention. Among them, Fast Object Segmentation in unorganized Video (FST) method and Object streaming Video Segmentation (OFL) method are the best unsupervised and semi-supervised methods at present, respectively.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, the present embodiment provides an unsupervised video segmentation method based on deep learning, and the details of the embodiment are as follows, and the following embodiments are not described in detail with reference to the summary of the invention.
Firstly, two networks are built, including a static image segmentation flow network and an interframe information segmentation flow network. The two networks have the same structure, and are based on an encoding-decoding architecture: the coding part is a full convolution network, comprising five generalized convolution layers (the first three layers are provided with convolution layers, a pooling layer and an activation layer, and the second two layers are not provided with a pooling layer) and the last expanded convolution layer. Wherein the last layer is divided into 4 types of expansion with different scales, each type forms a path, and the output result of the coding part is the average of the 4 paths of output results; the decoding part is also a full convolution network, which is connected after the encoding part and comprises three cyclic convolution layers and three upsampling layers. And finally, the output results of the two paths are consistent with the input size. The details of the encoding and decoding portions are as follows:
the specific structure of the coding part is as follows: (the generalized convolutional layers 1 to 5 listed below are all cascaded, the parallel operation is performed between 4 of the 6 th layer, and the cascade operation is performed between the 6 th layer and the 5 th layer)
Generalized convolutional layer 1: 1-1+ activation layer + 1-2+ activation layer + pooling layer of convolution layer;
generalized convolutional layer 2: 2-1+ activation layer + 2-2+ activation layer + pooling layer of convolution layer;
generalized convolution layer 3: the convolution layer is 3-1+ the active layer + the convolution layer is 3-2+ the active layer + the convolution layer is 3-3+ the active layer + the pooling layer;
generalized convolutional layer 4: a convolutional layer 4-1+ an active layer + a convolutional layer 4-2+ an active layer + a convolutional layer 4-3+ an active layer + a pooling layer;
generalized convolutional layer 5: 5-1+ of the convolution layer, 5-2+ of the convolution layer, 5-3+ of the convolution layer, and the activation layer and the pooling layer;
"expanded" convolutional layer 6-1: "expand" convolutional layer (displacement ═ 6) + active layer + random discard layer (dropout) + convolutional layer;
"expanded" convolutional layer 6-2: "expand" convolutional layer (displacement ═ 12) + active layer + random discard layer (dropout) + convolutional layer;
"expanded" convolutional layer 6-3: "expand" convolutional layer (displacement ═ 18) + active layer + random discard layer (dropout) + convolutional layer;
"expanded" convolutional layer 6-4: "expand" convolutional layer (displacement ═ 24) + active layer + random discard layer (dropout) + convolutional layer;
the specific structure of the decoding part is as follows: (the upsampling layer + cyclic convolution layers 3 to 1 listed below are all cascade operations)
Upsampling layer + cyclic convolution layer 3: the upsampling layer is twice the upsampling of the output of the previous layer. The cyclic convolution layer 3 is a layer in which the output of the encoded partial convolution layer 3-3 is subjected to convolution processing and is subjected to cyclic convolution operation with the output of the up-sampling layer.
Upsampling layer + cyclic convolution layer 2: the upsampling layer is twice the upsampling of the output of the previous layer. The cyclic convolution layer 2 is a layer in which the output of the encoded partial convolution layer 2-2 is subjected to convolution processing and is subjected to cyclic convolution operation with the output of the up-sampling layer.
Upsampling layer + cyclic convolution layer 1: the upsampling layer is twice the upsampling of the output of the previous layer. The cyclic convolution layer 1 is a layer in which the output of the encoded partial convolution layer 1-2 is subjected to convolution processing and is subjected to cyclic convolution operation with the output of the up-sampling layer. Is composed of
In this embodiment, "+" indicates a cascade relationship, subscripts 1-1 indicate the first-layer convolutional layer of the generalized convolutional layer 1, and subscripts 1-2 indicate the second-layer convolutional layer of the generalized convolutional layer 1; the subscript i-j represents the jth convolutional layer of the generalized convolutional layer i, wherein i is 1-5 and j is 1-3. Subscript 6-1 represents the first type of "expanded" convolutional layer of the "expanded" convolutional layer, subscript 6-2 represents the second type of "expanded" convolutional layer of the "expanded" convolutional layer, subscript 6-3 represents the third type of "expanded" convolutional layer of the "expanded" convolutional layer, and subscript 6-4 represents the fourth type of "expanded" convolutional layer of the "expanded" convolutional layer.
The details of the cyclic convolutional layer are shown in fig. 2, which can be seen as adding cyclic links along the time dimension on the basis of the convolutional layer, which has the advantages of: as the training time increases, each convolution layer increases the local perception range of the input under the condition of not increasing parameters, and local details can be captured and fused in an interchangeable way. As shown in fig. 2, the number of layers of cyclic connection is set to 3 in the present invention, which balances the hardware stress of training while ensuring the computational efficiency. After the networks are built, two paths of networks are trained respectively:
for still image split stream networks: three authoritative image segmentation data sets (including ECSSD, MSRA 10K and PASCAL VOC 2012) which are disclosed currently are selected, collected to obtain 21582 pictures, and the data sets are expanded to 10 degrees through operations of random cutting, mirroring, turning, zooming in and out, affine transformation and the like4And the magnitude reduces the overfitting possibly occurring in the training process. When training the network, firstly fixing the decoding part, and using 60% of data to train the coding part; and after the coding part converges, training the whole network by using 100% of training data, wherein the coding part is initialized by using the weight value before convergence, and the decoding part is initialized randomly.
For an interframe information split stream network: there is currently no large-scale video segmentation dataset disclosed, so we have to make the training dataset manually. The training video sets VID for video object detection in the ILSVRC2015 are collected first, and all the video sets have complete object detection marker boxes to accurately represent the positions of the objects. And then carrying out image segmentation on each frame of the video set by using the static image segmentation flow network obtained by training to obtain a foreground and background segmentation result. And then calculating the optical flow field between each video frame and storing the optical flow field information corresponding to each video frame into an RGB (red, green and blue) graph. And finally, screening the frames meeting the conditions and the segmentation results thereof by using a set of screening strategies and combining the existing marking frames of the video detection, and summarizing the frames to be used as training data of the training interframe information segmentation flow network.
The screening strategy includes two points: 1) reliable segmentation results: that is, the result of image segmentation for each frame of video occupies a range of 75% to 90% of the object detection mark frame. 2) Reliable optical flow field information: i.e. the calculated RGB-map of the optical flow field must satisfy a mean optical flow amplitude value between 5 and 100, very slow or fast movements will result in very inaccurate optical flow field information.
24960 training data is obtained by screening (as shown in FIG. 3, part of possible situations and processes occurring in the screening process), and the data set is expanded to 10 by random cropping, mirroring, flipping, zooming, affine transformation, etc4And the magnitude reduces the overfitting possibly occurring in the training process. When training the network, firstly fixing the decoding part, and using 60% of data to train the coding part; and after the coding part converges, training the whole network by using 100% of training data, wherein the coding part is initialized by using the weight value before convergence, and the decoding part is initialized randomly.
And after the two paths of networks are trained, building a final part, namely a fusion network. The network includes a connection layer and two generalized convolutional layers (including convolutional layer and active layer). The specific structure of the converged network is as follows: a connection layer, a convolution layer, an activation layer, a convolution layer, and an activation layer. The connection layer directly connects the static image segmentation flow network and the interframe information segmentation flow network, and outputs and fuses the two paths to be used as a final segmentation result. The three networks form a whole, namely a complete video segmentation network.
And finally, training the fusion network by using the partially and completely labeled video segmentation data set. The training process is carried out by the whole composed of the trained static image segmentation flow network, the trained interframe information segmentation flow network and the fusion network to be trained. In the training process, parameters of the fixed static image segmentation flow network and the interframe information segmentation flow network are not updated, a part of training sets in the completely labeled video segmentation data set DAVIS are selected to update parameters of the fusion network, and the training convergence is completed.
So far, the deep neural network required by the proposed unsupervised video segmentation method is prepared. The network can be directly used during testing without any post-processing work. The test flow is as follows: firstly, an optical flow field between video frames is obtained through calculation and an optical flow field RGB image corresponding to each frame is obtained through processing. Then, each frame of the video and the corresponding optical flow field RGB image are synchronously input into the static image segmentation flow network obtained in the second step and the interframe information segmentation flow network obtained in the fourth step. And finally, the output of the fusion network is the final segmentation result.
To demonstrate the advancement of the present invention, the method of the present invention was compared to currently representative unsupervised and semi-supervised methods. At present, most of the evaluation methods adopted by video segmentation methods use Intersection over U (Intersection over U), and the formula is defined as follows:
IoU=100×(S∩G)/(S∪G)
wherein: s is a segmentation result obtained by each algorithm, and G is a corresponding standard segmentation result. IoU being larger indicates better segmentation results.
TABLE 1
Figure BDA0001444094100000101
Table 1 compares the IoU results of the present method and other methods on both DAVIS and SegTrack v2 data sets. Wherein: the DAVIS data set is the most authoritative data set at present, pictures are 480p and 1080p, the variety of objects is multiple, and the labels are clear; the SegTrack v2 dataset objects were small and the video resolution was low. As seen from the results of the table: on the DAVIS dataset, the inventive method outperforms all unsupervised and semi-supervised methods, with the best unsupervised method FST being lifted 14%, and the best semi-supervised method being lifted close to two points. It should be noted that the semi-supervised method generally requires a long processing time because it requires accurate labeling of the first frame or the previous frames. In comparison with OFL, the OFL method needs approximately 2 minutes for processing a picture with 480p size, while the method only needs 0.2 seconds. On the SegTrack v2 dataset, the method of the invention is a bit worse than the OFL method, for the following possible reasons: (1) the resolution ratio of the video is low, and objects are small, so that the method for deep learning is not favorable for capturing detailed information; (2) the OFL method is a parameterized method that is optimized for each video in the experiment to obtain the best result, in contrast to the method of the present invention that is not optimized for a specific domain, the network used on all experimental videos is pre-trained. Fig. 4 is a visual comparison of the segmentation results of the method of the present invention, the FST method, and the OFL method, and it can be seen that the method of the present invention is best preserved in detail and has the highest segmentation accuracy.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. An unsupervised video segmentation method based on deep learning is characterized by comprising the following steps:
establishing a codec deep neural network, the codec deep neural network comprising: a static image segmentation flow network, an interframe information segmentation flow network and a fusion network; the static image segmentation flow network is used for performing foreground and background segmentation processing on a current video frame, and the inter-frame information segmentation flow network is used for performing foreground and background segmentation on optical flow field information between the current video frame and a next video frame;
and fusing the segmentation images output by the static image segmentation flow network and the interframe information segmentation flow network through the fusion network to obtain a video segmentation result.
2. The method of claim 1, wherein the building a codec deep neural network comprises:
establishing a static image segmentation flow network, and training the static image segmentation flow network through an image subjected to static image segmentation;
establishing an interframe information segmentation flow network, and training the interframe information segmentation flow network through a video subjected to interframe information segmentation;
and training the coding and decoding deep neural network by using the fully labeled video segmentation data.
3. The unsupervised video segmentation method based on deep learning of claim 2 wherein the static image segmentation stream network comprises: an encoding part and a decoding part which are formed by a full convolution network, wherein,
the full convolutional network of the encoded portion comprises: the expansion convolutional layer positioned on the sixth layer comprises four types of expansions with different scales, each type forms an output path, and the average value of the output results of the four types of output paths is the output result of the coding part;
the full convolution network of the decoding part is: a full convolution network consisting of three layers of circulating convolution layers and three layers of upper sampling layers; and the full convolution network of the decoding part is used for outputting a picture segmentation result consistent with the resolution of the input picture.
4. The unsupervised video segmentation method based on deep learning of claim 3 wherein five generalized convolutional layers in the full convolutional network of the coding portion comprise a first generalized convolutional layer, a second generalized convolutional layer, a third generalized convolutional layer, a fourth generalized convolutional layer, a fifth generalized convolutional layer which are cascaded, wherein:
the first generalized convolutional layer sequentially comprises: convolutional layer A11, active layer, convolutional layer A12, active layer, pooling layer;
the second generalized convolutional layer sequentially includes: convolutional layer A21, active layer, convolutional layer A22, active layer, pooling layer;
the third generalized convolutional layer sequentially includes: convolutional layer A31, active layer, convolutional layer A32, active layer, convolutional layer A33, active layer, pooling layer;
the fourth generalized convolutional layer sequentially includes: convolutional layer A41, active layer, convolutional layer A42, active layer, convolutional layer A43, active layer, pooling layer;
the fifth generalized convolutional layer sequentially comprises: convolutional layer A51, active layer, convolutional layer A52, active layer, convolutional layer A53, active layer, pooling layer;
the expanded convolutional layer cascaded with the fifth generalized convolutional layer in the full convolutional network of the coding part comprises: four types of expanded convolutional layers in parallel, wherein:
the first type of expanded convolutional layer sequentially comprises: a first scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the second type of expanded convolutional layer sequentially comprises: a second scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the third type of expansion convolution layer comprises in sequence: a third scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the fourth type of expansion convolution layer comprises in sequence: a fourth scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer.
5. The method of claim 4, wherein each upsampled layer is cascaded with a corresponding cyclic convolutional layer in a full convolutional network of the decoding part, wherein:
a first upsampling layer is cascaded with a third cyclic convolution layer, and the first upsampling layer is used for performing double upsampling on the output of the previous layer; the third cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer A33 and performing cyclic convolution operation on the output of the first up-sampling layer;
the second up-sampling layer is cascaded with the second circulating convolution layer, and the second up-sampling layer is used for performing double up-sampling on the output of the previous layer; the second cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer A22 and performing cyclic convolution operation on the output of the second up-sampling layer;
a third upsampling layer is cascaded with the first cyclic convolution layer, and the third upsampling layer is used for performing double upsampling on the output of the previous layer; the first cyclic convolution layer is used to convolve the output of the encoded partial convolution layer a12 with the output of the third upsampled layer.
6. The unsupervised video segmentation method based on deep learning of claim 3 wherein the training of the still image segmentation flow network by images that have been still image segmented comprises:
selecting sample pictures in an ECSSD image segmentation data set, an MSRA 10K image segmentation data set and a PASCAL VOC 2012 image segmentation data set;
randomly cutting, mirroring, turning, amplifying and reducing the sample picture, performing affine transformation, and expanding to 104The magnitude of data is used as training data;
fixing the decoding part, and training the coding part by using 60% of training data until the coding part converges;
training the static image segmentation flow network by using 100% of data; the encoding part uses the weight value in convergence to carry out initialization, and the decoding part carries out random initialization.
7. The method of claim 2, wherein the inter-frame information segmentation stream network comprises: the encoding part and the decoding part are cascaded with each other and formed by a full convolution network; wherein:
the full convolutional network of the encoded portion comprises: the expansion convolutional layer positioned on the sixth layer comprises four types of expansions with different scales, each type forms an output path, and the average value of the output results of the four types of output paths is the output result of the coding part;
five generalized convolutional layers in the full convolutional network of the coding part comprise a first generalized convolutional layer, a second generalized convolutional layer, a third generalized convolutional layer, a fourth generalized convolutional layer and a fifth generalized convolutional layer which are cascaded, wherein:
the first generalized convolutional layer sequentially comprises: convolutional layer B11, active layer, convolutional layer B12, active layer, pooling layer;
the second generalized convolutional layer sequentially includes: convolutional layer B21, active layer, convolutional layer B22, active layer, pooling layer;
the third generalized convolutional layer sequentially includes: convolutional layer B31, active layer, convolutional layer B32, active layer, convolutional layer B33, active layer, pooling layer;
the fourth generalized convolutional layer sequentially includes: convolutional layer B41, active layer, convolutional layer B42, active layer, convolutional layer B43, active layer, pooling layer;
the fifth generalized convolutional layer sequentially comprises: convolutional layer B51, active layer, convolutional layer B52, active layer, convolutional layer B53, active layer, pooling layer;
the expanded convolutional layer cascaded with the fifth generalized convolutional layer in the full convolutional network of the coding part comprises: four types of expanded convolutional layers in parallel, wherein:
the first type of expanded convolutional layer sequentially comprises: a first scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the second type of expanded convolutional layer sequentially comprises: a second scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the third type of expansion convolution layer comprises in sequence: a third scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the fourth type of expansion convolution layer comprises in sequence: a fourth scale expansion convolutional layer, an active layer, a random discard layer, a convolutional layer;
the full convolution network of the decoding part is: a full convolution network consisting of three layers of circulating convolution layers and three layers of upper sampling layers; the full convolution network of the decoding part is used for outputting a picture segmentation result consistent with the resolution of an input picture; wherein:
in a full convolutional network of a decoding part, each upsampling layer is cascaded with a corresponding cyclic convolutional layer, wherein:
a first upsampling layer is cascaded with a third cyclic convolution layer, and the first upsampling layer is used for performing double upsampling on the output of the previous layer; the third cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer B33 and performing cyclic convolution operation on the output of the first up-sampling layer;
the second up-sampling layer is cascaded with the second circulating convolution layer, and the second up-sampling layer is used for performing double up-sampling on the output of the previous layer; the second cyclic convolution layer is used for performing convolution processing on the output of the coding part convolution layer B22 and performing cyclic convolution operation on the output of the second up-sampling layer;
a third upsampling layer is cascaded with the first cyclic convolution layer, and the third upsampling layer is used for performing double upsampling on the output of the previous layer; the first cyclic convolution layer is used to convolve the output of the encoded partial convolution layer B12 with the output of the third upsampling layer.
8. The unsupervised video segmentation method based on deep learning of claim 7, wherein the training of the interframe information segmentation streaming network by the video that has been subjected to interframe information segmentation comprises:
collecting a training video set VID for video object detection in ILSVRC2015, wherein all the training video sets VID have complete object detection mark frames;
performing image segmentation on each frame of the video set VID by using a static image segmentation flow network obtained by training to obtain a foreground and background segmentation result;
calculating an optical flow field between each video frame and storing the optical flow field information corresponding to each video frame into an RGB (red, green and blue) graph;
screening out an image segmentation result with correct segmentation as an initial training image of the interframe information segmentation flow network according to a preset screening strategy and by combining a mark frame in the training video set VID; wherein the screening strategy satisfies the following conditions:
firstly, the method comprises the following steps: the result of image segmentation on each frame of the video occupies 75% to 90% of the object detection marker frame;
secondly, the method comprises the following steps: calculating the average optical flow amplitude value of the RGB image of the optical flow field to be between 5 and 100;
randomly cutting, mirroring, turning, amplifying and reducing the initial training image, performing affine transformation, and expanding to 104The magnitude of data is used as training data;
fixing the decoding part, and training the coding part by using 60% of training data until the coding part converges;
training the interframe information segmentation flow network by using 100% of training data; the encoding part uses the weight value in convergence to carry out initialization, and the decoding part carries out random initialization.
9. The method of claim 1, wherein the converged network comprises: a connection layer, a convolution layer, an activation layer, a convolution layer, and an activation layer; wherein:
the connection layer is used for connecting the static image segmentation stream network and the interframe information segmentation stream network, and fusing output results of the static image segmentation stream network and the interframe information segmentation stream network through a convolution layer, an activation layer, a convolution layer and an activation layer to obtain a final video segmentation result.
10. The unsupervised video segmentation method based on deep learning of any of claims 2-9, wherein the static image segmentation stream network and the interframe information segmentation stream network perform real-time updating of network parameters during training.
CN201711004135.5A 2017-10-24 2017-10-24 Unsupervised video segmentation method based on deep learning Active CN107808389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711004135.5A CN107808389B (en) 2017-10-24 2017-10-24 Unsupervised video segmentation method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711004135.5A CN107808389B (en) 2017-10-24 2017-10-24 Unsupervised video segmentation method based on deep learning

Publications (2)

Publication Number Publication Date
CN107808389A CN107808389A (en) 2018-03-16
CN107808389B true CN107808389B (en) 2020-04-17

Family

ID=61585461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711004135.5A Active CN107808389B (en) 2017-10-24 2017-10-24 Unsupervised video segmentation method based on deep learning

Country Status (1)

Country Link
CN (1) CN107808389B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108876792B (en) * 2018-04-13 2020-11-10 北京迈格威科技有限公司 Semantic segmentation method, device and system and storage medium
CN108712630A (en) * 2018-04-19 2018-10-26 安凯(广州)微电子技术有限公司 A kind of internet camera system and its implementation based on deep learning
CN108734211B (en) 2018-05-17 2019-12-24 腾讯科技(深圳)有限公司 Image processing method and device
CN108805898B (en) * 2018-05-31 2020-10-16 北京字节跳动网络技术有限公司 Video image processing method and device
CN110555805B (en) * 2018-05-31 2022-05-31 杭州海康威视数字技术股份有限公司 Image processing method, device, equipment and storage medium
CN109118490B (en) * 2018-06-28 2021-02-26 厦门美图之家科技有限公司 Image segmentation network generation method and image segmentation method
CN109034162B (en) * 2018-07-13 2022-07-26 南京邮电大学 Image semantic segmentation method
CN109086807B (en) * 2018-07-16 2022-03-18 哈尔滨工程大学 Semi-supervised optical flow learning method based on void convolution stacking network
CN109068174B (en) * 2018-09-12 2019-12-27 上海交通大学 Video frame rate up-conversion method and system based on cyclic convolution neural network
CN109785327A (en) * 2019-01-18 2019-05-21 中山大学 The video moving object dividing method of the apparent information of fusion and motion information
CN109961095B (en) * 2019-03-15 2023-04-28 深圳大学 Image labeling system and method based on unsupervised deep learning
CN110147763B (en) * 2019-05-20 2023-02-24 哈尔滨工业大学 Video semantic segmentation method based on convolutional neural network
CN110246142A (en) * 2019-06-14 2019-09-17 深圳前海达闼云端智能科技有限公司 A kind of method, terminal and readable storage medium storing program for executing detecting barrier
US10762629B1 (en) 2019-11-14 2020-09-01 SegAI LLC Segmenting medical images
US11423544B1 (en) 2019-11-14 2022-08-23 Seg AI LLC Segmenting medical images
CN111260679B (en) * 2020-01-07 2022-02-01 广州虎牙科技有限公司 Image processing method, image segmentation model training method and related device
CN111275518B (en) * 2020-01-15 2023-04-21 中山大学 Video virtual fitting method and device based on mixed optical flow
CN112016406B (en) * 2020-08-07 2022-12-02 青岛科技大学 Video key frame extraction method based on full convolution network
CN112784750B (en) * 2021-01-22 2022-08-09 清华大学 Fast video object segmentation method and device based on pixel and region feature matching
CN113469146B (en) * 2021-09-02 2021-12-14 深圳市海清视讯科技有限公司 Target detection method and device
CN114358144B (en) * 2021-12-16 2023-09-26 西南交通大学 Image segmentation quality assessment method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1532812A2 (en) * 2002-04-26 2005-05-25 The Trustees Of Columbia University In The City Of New York Method and system for optimal video transcoding based on utility function descriptors
CN106204597A (en) * 2016-07-13 2016-12-07 西北工业大学 A kind of based on from the VS dividing method walking the Weakly supervised study of formula

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1532812A2 (en) * 2002-04-26 2005-05-25 The Trustees Of Columbia University In The City Of New York Method and system for optimal video transcoding based on utility function descriptors
CN106204597A (en) * 2016-07-13 2016-12-07 西北工业大学 A kind of based on from the VS dividing method walking the Weakly supervised study of formula

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Learning to Segment Moving Objects in Videos;Katerina Fragkiadaki 等;《CVPR2015》;20151231;第4083-4090页 *

Also Published As

Publication number Publication date
CN107808389A (en) 2018-03-16

Similar Documents

Publication Publication Date Title
CN107808389B (en) Unsupervised video segmentation method based on deep learning
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
WO2023056889A1 (en) Model training and scene recognition method and apparatus, device, and medium
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN107679462B (en) Depth multi-feature fusion classification method based on wavelets
CN112884064A (en) Target detection and identification method based on neural network
CN107862376A (en) A kind of human body image action identification method based on double-current neutral net
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN111696110B (en) Scene segmentation method and system
Li et al. Learning face image super-resolution through facial semantic attribute transformation and self-attentive structure enhancement
CN113536972B (en) Self-supervision cross-domain crowd counting method based on target domain pseudo label
CN112101262B (en) Multi-feature fusion sign language recognition method and network model
Niu et al. Effective image restoration for semantic segmentation
CN115359370B (en) Remote sensing image cloud detection method and device, computer device and storage medium
Wang et al. Semantic segmentation of high-resolution images
Jiang et al. Mirror complementary transformer network for RGB‐thermal salient object detection
CN111462132A (en) Video object segmentation method and system based on deep learning
CN109670506A (en) Scene Segmentation and system based on Kronecker convolution
Ren et al. A lightweight object detection network in low-light conditions based on depthwise separable pyramid network and attention mechanism on embedded platforms
Xu et al. Multi-scale dehazing network via high-frequency feature fusion
WO2024040973A1 (en) Multi-scale fused dehazing method based on stacked hourglass network
Wang et al. Uneven image dehazing by heterogeneous twin network
CN103632357B (en) A kind of image super-resolution Enhancement Method separated based on illumination
CN115439669A (en) Feature point detection network based on deep learning and cross-resolution image matching method
CN114419729A (en) Behavior identification method based on light-weight double-flow network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant