CN113065459A

CN113065459A - Video instance segmentation method and system based on dynamic condition convolution

Info

Publication number: CN113065459A
Application number: CN202110347704.6A
Authority: CN
Inventors: 郑元杰; 隋晓丹; 姜岩芸; 刘弘; 牛屹
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-02
Anticipated expiration: 2041-03-31
Also published as: CN113065459B

Abstract

The invention provides a video instance segmentation method and a system based on dynamic condition convolution, wherein the method comprises the following steps: inputting video data into a video example segmentation model for training; wherein, the video instance segmentation model comprises: the method comprises the following steps of sequentially connecting a characteristic pyramid network, a dynamic conditional convolution network and a full connection label generation network; extracting multi-scale features of a 0 th frame image with example label images and subsequent frame images in the video by using the feature pyramid network; the dynamic condition convolution network is used for generating a dynamic condition convolution kernel; the full-connection label generation network uses a dynamic condition convolution to check the characteristic convolution operation of the subsequent frame of the video to obtain an example label in the subsequent frame; constraining the video instance segmentation model based on a loss function, and outputting the trained video instance segmentation model; and acquiring an example label of a first frame of the video to be detected, inputting the example label of the first frame of the video into the trained video example segmentation model, and outputting a corresponding example in a subsequent frame of the video.

Description

Video instance segmentation method and system based on dynamic condition convolution

Technical Field

The invention belongs to the technical field of video image processing, and relates to a video instance segmentation method and system based on dynamic condition convolution.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Video instance segmentation is a very classic task in computer vision, and specifically, a segmentation label of an object is given in the 0 th frame, and the object is continuously located in the subsequent frames based on the segmentation label.

Video automatic segmentation was first initiated from motion segmentation, analyzing regions of motion in the video. Under the geometric constraint, particularly, the point track is emphasized, and the motion information is analyzed through a clustering algorithm to obtain a final result. However, such methods simply consider the moving part as the target to be segmented in the video, easily cause fragmentation of the segmentation result, and cannot completely and friendly express the information of the object level. With the generation of deep learning technology, unsupervised and semi-supervised video target segmentation receives more and more attention, and a deep learning method is used for segmenting foreground target objects, wherein the foreground target objects can be static or moving and are the most key and most obvious objects in a video.

The video instance segmentation not only needs to accurately segment the segmentation result of the pixel level, but also needs to predict the corresponding semantic category, so that the same object is allocated with the same ID on different frames.

The early common method is that given the label of frame 0, the correlation between frames can be used, for example, the optical flow method. And conducting the labeling sequence of the 0 th frame to the second frame, and conducting the labeling of the second frame to the third frame until the last frame. However, due to the phenomena of motion and partial occlusion of the target object in the video, the method simply relying on sequential transmission does not achieve the ideal effect.

The online learning method divides a video target into a training stage and a testing stage: (1) in the training stage, training a video segmentation model by using a training set; (2) during testing, a test video is given, the 0 th frame of the test data is used for data enhancement, and the model trained by the training set is optimized on the sample of the 0 th frame of the expanded test set. This type of method is very time consuming and requires on-line learning using the 0 th frame of the test video every time a test video is given.

In summary, the prior art is not yet provided with a solution for the video instance segmentation problem, which is highly accurate and efficient.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a video instance segmentation method and system based on dynamic condition convolution, which can automatically predict corresponding instances in subsequent frames according to instance labels of a first frame of a video, and can be carried out in a full-automatic manner without carrying out interactive operation.

In order to achieve the purpose, the invention adopts the following technical scheme:

a first aspect of the invention provides a method for video instance segmentation based on dynamic conditional convolution.

A video instance segmentation method based on dynamic conditional convolution comprises the following steps:

inputting video data into a video example segmentation model for training;

wherein the video instance segmentation model comprises: the method comprises the following steps of sequentially connecting a characteristic pyramid network, a dynamic conditional convolution network and a full connection label generation network; extracting multi-scale features of a 0 th frame image with example label images and subsequent frame images in the video by the feature pyramid network; the dynamic conditional convolution network extracts target example features from the multi-scale features of the subsequent frame images, and combines the target example features with the 0 th frame image with the example label to generate a dynamic conditional convolution kernel; the full-connection label generation network uses a dynamic condition convolution core to carry out feature convolution operation on subsequent frames of the video to obtain example labels in the subsequent frames;

constraining the video instance segmentation model based on a loss function, and outputting the trained video instance segmentation model;

and acquiring an example label of a first frame of the video to be detected, inputting the example label of the first frame of the video into the trained video example segmentation model, and outputting a corresponding example in a subsequent frame of the video.

A second aspect of the invention provides a system for video instance segmentation based on dynamic conditional convolution.

A system for video instance segmentation based on dynamic conditional convolution, comprising:

a model training module configured to: inputting video data into a video example segmentation model for training;

a model constraint module configured to: constraining the video instance segmentation model based on a loss function, and outputting the trained video instance segmentation model;

a model testing module configured to: and acquiring an example label of a first frame of the video to be detected, inputting the example label of the first frame of the video into the trained video example segmentation model, and outputting a corresponding example in a subsequent frame of the video.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for segmenting a video instance based on dynamic conditional convolution as defined above in relation to the first aspect.

A fourth aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the method for video instance segmentation based on dynamic conditional convolution as described in the first aspect above when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

1. on the aspect of segmentation effect, the invention firstly provides a video instance segmentation model based on dynamic condition convolution. The 0 th frame of video image with the label is subjected to pyramid network extraction of multi-scale features and is combined with the corresponding label of the image, the generated dynamic condition convolution kernel can implicitly encode the features of the example in the image, the irregular shape of the example can be flexibly represented, and compared with a method based on ROI, the accuracy of model operation is improved.

2. In practicability and expansibility, the method is based on simple and flexible FCOS target detection and is combined with a dynamic condition convolution method. In one embodiment, the method takes a video segmentation as an example, and realizes automatic segmentation of the example in the video. In the video example segmentation method based on the dynamic condition convolution kernel, the pyramid network of the model respectively extracts the multi-scale features of each frame in the video for subsequent generation of the dynamic condition convolution kernel and segmentation of the examples in the video.

3. In the aspect of calculation efficiency, the convolution kernel of the fully-connected label generation network is dynamically generated according to the example region characteristics of the 0 th frame and the subsequent frames, only one example label in the video is calculated each time, the information load capacity of the conditional convolution kernel is reduced, and therefore the calculation efficiency is improved.

4. In the aspect of operation speed, the dynamic condition convolution kernel is used, so that the model does not need to be optimized on the 0 th frame image after data expansion in the test stage, and the model operation speed is improved.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a video instance segmentation method based on dynamic conditional convolution kernel according to the present invention;

FIG. 2 is a model diagram of a video example segmentation method based on a dynamic conditional convolution kernel according to the present invention;

FIG. 3 is a schematic diagram of a feature pyramid network in an embodiment of the invention;

FIG. 4 is a schematic diagram of a dynamic conditional convolution network in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a fully-connected label-generating network in an embodiment of the invention;

FIG. 6 is a diagram illustrating the result of data expansion according to an embodiment of the present invention;

FIG. 7 is an exemplary illustration of an example division in an embodiment of the present invention;

FIG. 8 is an exemplary graph of a set of video example segmentation results based on dynamic convolution kernels in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

As shown in fig. 1-2, the present embodiment provides a video instance segmentation method based on dynamic conditional convolution, and the present embodiment is illustrated by applying the method to a server, it is understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:

s101: inputting video data into a video example segmentation model for training;

specifically, the pyramid network extracts the multi-scale features of the 0 th frame and the subsequent frames respectively; and inputting the fused image multi-scale features into a dynamic condition network, and combining the fused image multi-scale features with the 0 th frame label for calculation to obtain the basis filtering. And combining the base filtering to obtain a group of dynamic convolution kernels which can be used for a fully-connected label generation network, and further performing convolution operation on the multi-scale features of the frame image to be calculated to obtain an example segmentation result. The three networks form a unified framework and can be trained in an end-to-end learning mode.

In the training process, 4 continuous images are randomly extracted from the video sequence to form a new video segment. And respectively extracting multi-scale features of the 0 th frame and the subsequent frames by the feature pyramid network to generate convolution kernels kernel1 and kernel2 which are used as convolution kernels of the full-connection generation network to generate a final video segmentation result.

As shown in fig. 3, the Feature Pyramid Network (FPN) consists of ResNet blocks, pooling layers, and convolution kernels with step size 2. The network structure is named because it is similar to a pyramid. The image is subjected to path convolution operation from bottom to top and downsampling, the feature map is smaller and smaller, and the last layer of features of each scale are used as high-level semantic features output by a convolution structure from bottom to top; performing convolution and up-sampling on the high-level semantic features through a top-down path; the two paths are connected by a skip connection and are subjected to feature fusion; and outputting the fused multi-scale features for subsequent detection or segmentation operation. The image in the video is subjected to feature pyramid network extraction to obtain depth image features of different scales, and multi-scale features are output through up-sampling and feature fusion and can be used as the input of a dynamic condition convolution network and a full-connection label generation network.

It should be noted that object detection and segmentation of objects with large size differences is a difficult problem for computer vision task processing. The use of multi-scale features enables the network to capture information on different scales, which is of great help to improve network performance.

In one or more embodiments, as shown in FIG. 4, a Dynamic Conditional Convolutional Network (DCCN) is comprised of dynamic convolutional layers. Fused multi-scale features { P } from feature pyramid network outputs₃，P₄，P₅Obtaining a characteristic graph through three trainable convolution operations respectively, combining the characteristic graph with the example label of the 0 th frame and generating a group of base filtersThe wave, basis filter combination is a 3 x 3 convolution kernel. For assigning values to the fully-connected label generation network.

It should be noted that the dynamic conditional convolutional network is a full convolutional neural network, which can achieve Faster performance than other detection/segmentation methods instead of the ROI operation in other deep learning-based detection/segmentation methods (e.g., fast R-CNN, Mask R-CNN). The dynamic condition convolution network dynamically generates the convolution kernel of the picture to be analyzed by taking the original example as the condition, can carry out conditional learning with less data quantity, and reduces the data quantity required by the training model. In addition, the network does not need to be re-optimized on the 0 th frame in the test stage, and the reasoning speed is effectively accelerated

As one or more embodiments, as shown in fig. 5, a full connectivity label generation network (FCMN) is composed of two layers of convolutions, and a convolution kernel parameter is derived from a prediction of a dynamic conditional convolution network, and a prediction result of example segmentation in a subsequent frame image is obtained through two times of full convolution.

S102: constraining the video instance segmentation model based on a loss function, and outputting the trained video instance segmentation model;

as one or more embodiments, the total loss function of the model is expressed as:

L_Total＝L_fcos+λL_mask

L_fcosfor the original loss of FCOS, L_maskTo illustrate the split loss, λ is used to balance the two losses.

L_maskIs defined as:

wherein the content of the first and second substances,

is the classification label of the position (x, y) and represents the classification of the instance of the position, if there is no instance at this position, the position is the background area, and the classification result is 0. N is a radical of_posIs that

I.e. the number of instances.

Is an index function, if

It is 1, otherwise it is 0. Theta_x，yIs the filter parameter for location (x, y).

Is F_maskAnd coordinates

A combination of the mappings of (c). As described above, O_x，yIs the relative coordinate of all locations on the feature map to (x, y). Mask represents the characteristic graph passing through the dynamic convolution parameter theta_x，yAnd (5) convolution results.

Is the actual tag at location (x, y).

S103: and acquiring an example label of a first frame of the video to be detected, inputting the example label of the first frame of the video into the trained video example segmentation model, and outputting a corresponding example in a subsequent frame of the video.

The example label of the first frame of the video to be tested refers to: and only the first frame image of the video data to be detected is provided with example segmentation label data.

As one or more embodiments, before the inputting the video data into the video instance segmentation model for training, the method includes: video data are obtained, video data expansion is carried out on the basis of original video data by adopting a data expansion method, and a generated video data training set is used for training a video instance segmentation model. Wherein all frames of all video data in the video data training set carry instance segmented label data.

The method of claim 1, wherein the video instance segmentation model comprises two feature pyramid networks of the same structure, each feature pyramid network comprising a ResNet block, a pooling layer, and a convolution kernel with a step size of 2.

The input of the dynamic conditional convolution network is the multi-scale features of the subsequent frame image extracted by the feature pyramid network, the multi-scale features of the subsequent frame image are respectively subjected to three training convolution operations to obtain feature vectors, then the feature vectors are combined with the 0 th frame image with the example label to generate a group of basis filtering, the basis filtering is combined into a convolution kernel, and the convolution kernel is assigned to the fully-connected label generation network.

As one or more implementation modes, the fully-connected label generation network comprises two layers of convolution, convolution kernel parameters are predicted by the dynamic conditional convolution network, and an example segmentation prediction result in a subsequent frame image is obtained through two times of full convolution operation.

Example two

The embodiment provides a video instance segmentation system based on dynamic conditional convolution.

illustratively, the video instance segmentation model is set to use a data path, a pre-training model, a model storage path and the like, and training parameters such as initialization, deviation, regularization, an initial learning rate, a learning rate reduction mode, an optimization algorithm, iteration times, a data enhancement mode and the like to realize training of the video instance segmentation model.

Optionally, the parameters of the feature pyramid network are migration learned using a model pre-trained with the ImageNet dataset, rather than using an initialized model. No change occurs in subsequent training. Except the parameters in the characteristic pyramid network, other variable parameters in the whole model are subjected to back propagation optimization in the training process, and the trained model is used for video instance segmentation.

It should be noted that knowledge or experience learned by the model from previously learned tasks, when applied to new tasks, is trained faster or works better. The method uses the ImageNet data set pre-trained model, and can ensure that the feature pyramid network can still extract the key information of the image for subsequent detection and segmentation tasks without massive data training.

In the training process, 4 continuous images are randomly extracted from the whole video sequence to form a new video segment. Respectively extracting multi-scale features of the 0 th frame and the subsequent frames of the new segment by the feature pyramid network, and generating convolution kernels kernel1 and kernel2 by the dynamic conditional convolution network to serve as convolution kernel parameters in the full-connection generation network; and performing convolution operation on the multi-scale features corresponding to the subsequent frame to be segmented to generate a final video instance segmentation result.

Illustratively, the model testing process is similar to the training process, setting up input images, using models, and so forth. In one embodiment, the inputs include a test folder path, a test model, a number of test images, a test result output path. And extracting video frames from the whole video, generating a dynamic condition convolution kernel, performing convolution on the multi-scale features, and generating a video frame example segmentation result. And finally, displaying a test result, and displaying an example segmentation result generated by the model.

In one embodiment, the model testing module includes a file store and a visual display. The generated file is stored in a memory of the computer device, and the generated video segmentation result is visually displayed.

As can be seen from fig. 8, the technical solution of the present invention ensures the accuracy of example segmentation in the video.

In one or more embodiments, a video data acquisition module and a video data expansion module are further included before the model training module.

Wherein the video data acquisition module is configured to: video data is acquired.

Illustratively, the present embodiment uses a data set exposed by the network for the experiment. Video data comes from two open games: DAVIS 2016- (dense Annotated VIdeo Segmentation), Youtube-VOS (Large Scale-Scale Benchmark for VIdeo Object Segmentation), 2020.

A video data augmentation module configured to: and performing video data expansion on the basis of the original video data by adopting a data expansion method, wherein the generated video data training set is used for training a video instance segmentation model. Wherein all frames of all video data in the video data training set carry instance segmented label data.

And performing data expansion processing on the training set and the test set video, and generating a new data set on the basis of the original data to realize the function of expanding the data set. Specifically, the augmentation modes include simple augmentation methods (e.g., flipping, rotation, saturation, grayscale, brightness, center clipping, contrast, color flipping, affine transformation, etc.) and random erasing or modifying of partial regions in the image (e.g., Cutout, CutMix operations). In one embodiment, as shown in FIG. 6, the result of data augmentation on one frame of a video sequence is shown.

It should be noted that, when the video data is subjected to the data expansion process, the same augmentation operation needs to be performed on a sequence of images to maintain the consistency of the video data.

When performing the expansion processing on the video data, the specific implementation encoding language used is not limited, and for example, the data expansion processing may be performed in any programming language such as Matlab or Python.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the dynamic conditional convolution-based video instance segmentation method as described in the first embodiment above.

Example four

The present embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps in the video instance segmentation method based on dynamic conditional convolution as described in the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video instance segmentation method based on dynamic conditional convolution is characterized by comprising the following steps:

inputting video data into a video example segmentation model for training;

2. The method according to claim 1, wherein before said training of said input video data into a video instance segmentation model, comprising:

video data are obtained, video data expansion is carried out on the basis of original video data by adopting a data expansion method, and a generated video data training set is used for training a video instance segmentation model.

3. The method according to claim 2, wherein all frames of all video data in the training set of video data carry label data for instance segmentation.

4. The method according to claim 1, wherein the instance label of the first frame of the video to be tested refers to: and only the first frame image of the video data to be detected is provided with example segmentation label data.

5. The method of claim 1, wherein the video instance segmentation model comprises two feature pyramid networks of the same structure, each feature pyramid network comprising a ResNet block, a pooling layer, and a convolution kernel with a step size of 2.

6. The method for segmenting the video examples based on the dynamic conditional convolution according to claim 1, wherein the input of the dynamic conditional convolution network is multi-scale features of subsequent frame images extracted by a feature pyramid network, the multi-scale features of the subsequent frame images are respectively subjected to three trained convolution operations to obtain feature vectors, and then the feature vectors are combined with the 0 th frame image with the example labels to generate a group of basis filters, the basis filters are combined into convolution kernels, and the convolution kernels are assigned to a full-connection label generation network.

7. The method according to claim 1, wherein the fully-connected label generation network comprises two layers of convolution, and the convolution kernel parameter is predicted from the dynamic conditional convolution network, and the prediction result of the instance segmentation in the subsequent frame image is obtained through two times of full convolution operations.

8. A system for video instance segmentation based on dynamic conditional convolution, comprising:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for video instance segmentation based on dynamic conditional convolution according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps in the method for video instance segmentation based on dynamic conditional convolution according to any one of claims 1 to 7 when executing the program.