CN115761574A

CN115761574A - Weak surveillance video target segmentation method and device based on frame labeling

Info

Publication number: CN115761574A
Application number: CN202211322815.2A
Authority: CN
Inventors: 胡建芳; 林子杭; 谭超镭; 郑伟诗; 王军
Original assignee: Zhejiang Lab; Sun Yat Sen University
Current assignee: Zhejiang Lab; Sun Yat Sen University
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-03-07

Abstract

The invention discloses a frame labeling-based weak surveillance video target segmentation method and a frame labeling-based weak surveillance video target segmentation device, wherein the method comprises the following steps of: training a pseudo-label generation model based on a PReMVS model on an image segmentation data set; generating corresponding pseudo mask labels for the video data and the frame labels frame by using a pseudo label generation model; and training a video target segmentation model by using a 'cooperation teaching' algorithm by using the generated pseudo mask label. According to the method, the low-cost frame is used for marking and training the video target segmentation model, so that the cost of marking new data when the video target segmentation model is transferred to an actual application scene is reduced, and the landing difficulty of the video target segmentation model is reduced; by using the training algorithm of 'cooperative teaching', the existing large number of video target tracking data sets can be more fully utilized to train the video target segmentation model, and the performance and generalization capability of the video target segmentation model are enhanced.

Description

Weak surveillance video target segmentation method and device based on frame labeling

Technical Field

The invention relates to the technical field of computer vision, in particular to a weak surveillance video target segmentation method and device based on frame labeling.

Background

Video object segmentation is to segment certain specific or human interesting object objects in each frame in a given video. The video target segmentation is used as a basic task in the field of computer vision, and has important significance on video understanding and analysis. In the aspect of specific application, video target segmentation has important significance for practical applications such as video editing, man-machine interaction, automatic driving and the like.

The existing related advanced algorithms and technologies are mainly based on a deep learning technology, and with the deduction of large-scale data sets with fine labels such as DAVIS, youtube-VOS and the like, the existing labeled data can basically support the training of a deep learning model, and some existing methods can achieve quite excellent effects. The existing video target segmentation algorithms mainly comprise matching-based algorithms, possible object-based algorithms, mask propagation-based algorithms, target tracking-based algorithms and the like, but model training in the algorithms is very dependent on large-scale high-quality labeled data in the existing data set. In the existing video target segmentation technology, fine mask labeling at a pixel level frame by frame is required in a training stage, and the labeling cost of a data set is very high, so that the scale of the data set is limited, and the labeling cost when the existing method is migrated to a specific application scene is difficult to accept. The existing methods capable of achieving better performance all depend on a large amount of high-quality fine labeled data, that is, most of the methods are data-driven, and the size of the data set limits the development of the data-driven deep learning algorithm to a certain extent.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a frame-labeling-based method and a frame-labeling-based device for segmenting a weakly supervised video target, wherein the method trains a video target segmentation model by using low-cost frame labeling, thereby reducing the cost of labeling new data when the video target segmentation model is migrated to an actual application scene and reducing the landing difficulty of the video target segmentation model; by using the training algorithm of 'cooperative teaching', the existing large number of video target tracking data sets can be more fully utilized to train the video target segmentation model, and the performance and generalization capability of the video target segmentation model are enhanced.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a frame labeling-based weak surveillance video target segmentation method on one hand, which comprises the following steps:

training a pseudo-label generation model based on a PReMVS model on an image segmentation data set, wherein the input of the pseudo-label generation model is an original video and a corresponding frame label, and the output of the pseudo-label generation model is a pseudo mask label;

generating corresponding pseudo mask labels for the video data and the frame labels frame by using a pseudo label generation model;

marking by using the generated pseudo mask, training a video target segmentation model by using a 'cooperation teaching' algorithm, and performing target segmentation on video data by using the trained video target segmentation model to obtain a target segmentation result; the cooperative teaching algorithm is to screen cleaner data for the opposite side respectively in each iteration of a training stage for two networks with the same structure and different parameters so as to provide for the opposite side to train and relieve the influence of noise marking.

As a preferred technical solution, the training of the pseudo-label generation model based on the primvos model on the image segmentation dataset means that an optimized redefinement module in the primvos model is used to design the pseudo-label generation model, and the training is performed on the image segmentation dataset Mapillary dataset, specifically:

the input of the pseudo-label generation model is an original image and a corresponding frame label, and the original image and a binary image corresponding to the frame label are connected in series to obtain a four-channel original input; on the Mapilary data set, the frame labels are deduced according to the original fine mask labels in the data set, namely, each mask label takes the frame which just can completely contain the mask as the corresponding frame label; the pixel value in the binary image frame is 1, and the rest are 0;

properly amplifying the marked frame to obtain a cutting area, and cutting the original input according to the cutting area to obtain a cut image;

and inputting the cut image into a segmentation network to output a segmentation mask corresponding to the target object.

As a preferred technical solution, the marked frame is appropriately enlarged, and the specific method adopted is as follows:

and n pixels respectively extend in the upper, lower, left and right directions of the marked frame.

As a preferred technical scheme, the segmentation network structure used by the pseudo-labeling generation model is a Deeplab-v3+ structure, and a pixel-by-pixel cross entropy function is used as a loss function for model training.

As a preferred technical solution, the generating a corresponding pseudo mask label frame by using a pseudo label generation model includes:

converting video data in the Youtube-VOS dataset into image frames using a video image conversion tool;

inputting each image frame and the corresponding frame label into a pseudo label generation model to obtain a pseudo mask label of each frame;

for the same frame with a plurality of target objects, processing each object one by one when generating the pseudo mask labels to obtain the pseudo mask label of each object, and if some two pseudo mask labels are overlapped, considering that the overlapped part belongs to the pseudo mask label with smaller area.

As a preferred solution, ffmpeg tools are used to convert the video data into image frames.

As a preferred technical solution, the training of the video object segmentation model by using the generated pseudo mask label and the "cooperation teaching" algorithm specifically includes:

randomly initializing two video target segmentation models with the same structure and different parameters, recording the two models as a model A and a model B, marking the models A and B on a Youtube-VOS data set by using a generated pseudo mask, and primarily training the models A and B by using a frame-by-frame pixel-by-pixel cross entropy loss function, wherein the training process ensures that the training samples used by the models A and B in each training iteration are different;

after the preliminary training is finished, entering a 'cooperation teaching' training stage, sampling a batch of training samples in each training iteration, respectively carrying out primary segmentation on the training samples by using models A and B, calculating frame-by-frame pixel-by-pixel cross entropy loss function values between a segmentation result and a generated pseudo mask label, sequencing the cross entropy loss function values of pixels in a frame labeling area according to an output result of the model A, taking R (T)% pixels with smaller loss function values, and labeling the training model B by using the pseudo mask corresponding to the pixels; the above-mentioned

Where T is the current training iteration step number, T _k τ is a parameter for controlling the maximum value of R (T) for controlling the rate of increase of R (T);

according to the output result of the model B, sequencing the cross entropy loss function values of the pixels in the border marking area, taking R (T)% of the pixels with smaller loss function values, and marking the training model A by using a pseudo mask corresponding to the pixels;

for pixels outside the frame label, training the models A and B by taking the pixels as background categories;

and when the prediction results of the model A and the model B gradually get the same, finishing training and taking the model A as a final video target segmentation model.

As a preferred technical solution, the R (T)% parameter is gradually increased with the increase of the training iteration steps, that is, more and more pseudo mask labels are screened out to participate in the training model in the training process.

The invention provides a frame labeling-based weak surveillance video target segmentation system, which is applied to the frame labeling-based weak surveillance video target segmentation method and comprises a pseudo labeling generation model training module, a pseudo mask labeling generation module and a segmentation model training module;

the pseudo-label generation model training module is used for training a pseudo-label generation model based on a PReMVS model on the image segmentation data set;

the pseudo mask label generating module is used for generating corresponding pseudo mask labels for the video data and the frame labels frame by using a pseudo label generating model;

and the segmentation model training module is used for training a video target segmentation model by using a 'cooperation teaching' algorithm by using the generated pseudo mask mark.

The invention further provides a computer readable storage medium containing a program which, when executed, implements the border annotation based method for partitioning weakly supervised video objects.

Compared with the prior art, the invention has the following advantages and beneficial effects:

aiming at the problems that training data labeling cost is high and sufficient large-scale data is difficult to collect in the field of video target segmentation, the invention provides a frame labeling-based weak surveillance video target segmentation method and device.

The method can be used for marking and training the video target segmentation model by utilizing the frame with low marking cost, so that the marking cost during the training of the video target segmentation model is greatly reduced, the cost of the video target segmentation model when the video target segmentation model is transferred to an actual application scene is reduced, and the landing difficulty of the video target segmentation model is reduced.

The reason why the method can successfully train the video target segmentation model by using frame labeling mainly includes two reasons: the method comprises the steps that firstly, pseudo labels with relatively high quality can be generated by a pseudo label generation model in the method, and the pseudo labels contain abundant structural information beneficial to training of a video target segmentation model; and secondly, the used training algorithm based on 'cooperative teaching' can greatly relieve the influence of noise in pseudo-labeling on training, so that the model is prevented from being excessively interfered by wrong pseudo-labeling.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart of a method for partitioning a weakly supervised video target based on frame labeling according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data pseudo-annotation generation model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a "cooperative teaching" training algorithm according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a system for partitioning a weakly supervised video target based on frame labeling according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings in the present application, and it should be understood that the accompanying drawings are only for illustrative purposes and are not to be construed as limiting the present patent. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments

Examples

As shown in fig. 1, this embodiment is a method for segmenting a weakly supervised video target based on frame labeling, including the following steps:

the method comprises the following steps of training a pseudo-label generation model based on a PReMVS model in an image segmentation data set, specifically:

the method designs a pseudo-label generation model by using an optimization (refinement) module in a PReMVS model, and trains the model on an image segmentation data set Mapilary data set. The pseudo-label generation model takes an original image and a corresponding frame label as input, outputs a corresponding mask label, and is shown in a specific schematic diagram of fig. 2.

The main ideas of the model are as follows: marking the frame as an input channel of the deep neural network so that the model obtains some position information about the target object, and segmenting the input image; meanwhile, areas of the image part irrelevant to the target object can be cut off according to the frame marks so that the model can focus on the segmentation of the target object;

the details of the model are as follows: firstly, connecting binary images (the pixel value in the frame is 1, and the rest is 0) corresponding to the original image and the frame label in series to obtain a four-channel original input, then properly amplifying the frame, namely respectively extending 50 pixels in the four directions of the upper direction, the lower direction, the left direction and the right direction to obtain a cutting area, cutting the original input according to the cutting area to obtain a cut image, using the cut image as the input of a cutting network, and outputting a cutting mask corresponding to a target object by the cutting network;

the segmentation network structure used by the model is a Deeplab-v3+ structure, and the pixel-by-pixel cross entropy function is used as a loss function to carry out model training to obtain a pseudo-label generation model.

Secondly, generating corresponding pseudo mask labels for the video data and the frame labels frame by using a pseudo label generation model, which specifically comprises the following steps:

firstly, converting video data in a Youtube-VOS data set into image frames by using a video image conversion tool (such as a tool like ffmpeg), and then inputting the image frames and corresponding frame labels into a pseudo label generation model to obtain pseudo mask labels of each frame; as a plurality of target objects needing to be segmented may exist in the same frame, and the borders of the objects may overlap, for the condition that a plurality of target objects exist in the same frame, each object is processed one by one when the pseudo mask label is generated to obtain the pseudo mask label of each object, if two pseudo mask labels are overlapped, the overlapped part is considered to belong to the pseudo mask label with a smaller area.

Thirdly, marking by using the generated pseudo mask, and training a video target segmentation model by using a 'cooperation teaching' algorithm, wherein the method specifically comprises the following steps:

the method appropriately changes a co-teaching (co-teaching) algorithm in the learning field with noise labeling to be applied to the learning of a video target segmentation model, so as to relieve the negative influence of the pseudo mask labeling with a certain amount of noise in the model learning process;

the algorithm of "cooperative teaching" and its modification by the present invention are described below:

the details of the "cooperative teaching" algorithm are shown in fig. 3, in the "cooperative teaching" algorithm, two networks with the same structure and different parameters screen cleaner data for the other party respectively in each iteration of the training stage so as to be used for the training of the other party, thereby relieving the influence of noise marking; the principle of the algorithm is: the deep neural network tends to firstly remember samples which are easy to learn and then remember samples which are difficult to learn in a training stage, and the distribution of noise marking samples is generally difficult to learn, so that the noise marking samples and clean samples can be distinguished to a certain extent in the training stage according to a loss function value, and cleaner samples in a certain proportion are screened out for model training;

in the invention, the modified 'cooperation teaching' algorithm flow is specifically as follows:

firstly, randomly initializing two video target segmentation models with the same structure and different parameters, marking as a model A and a model B, marking on a Youtube-VOS data set by using a generated pseudo mask, and performing preliminary training of a small number of iteration steps on the models A and B by using a frame-by-frame pixel-by-pixel cross entropy loss function, wherein the training process ensures that training samples used by the models A and B in each training iteration are different so as to ensure that the models A and B have certain difference;

after finishing the preliminary training, then entering a 'cooperation teaching' training stage, sampling a batch of training samples in each training iteration, respectively carrying out primary segmentation on the training samples by using models A and B, calculating frame-by-frame pixel-by-pixel cross entropy loss function values between a segmentation result and a generated pseudo mask label, sequencing the cross entropy loss function values of pixels in a frame labeling area for an output result of the model A, taking R (T)% pixels with smaller loss function values, and training the model B by using the pseudo labels corresponding to the pixels; wherein

for the output result of the model B, adopting the same operation to screen out R (T)% pixels with smaller loss function values in the frame marking area, and marking the training model A by using a pseudo mask corresponding to the pixels; wherein, the parameter R (T)% is gradually improved along with the increase of the training iteration steps, namely, more and more false labels are screened out to participate in the training model in the training process;

for pixels outside the frame label, all the pixels are used as background categories to train the models A and B; at the later stage of the whole training process, the prediction results of the model A and the model B gradually converge, so that after the training is finished, the model A is taken as a final video target segmentation model, and the model trained in the mode has better capability of segmenting target objects in videos.

In this embodiment, the algorithm of "cooperative teaching" specifically uses an AGAME video object segmentation model, but any other model that can use the pixel-by-pixel cross entropy as a loss function is suitable for the method; in the experiment, only the border label with lower label cost is used on the Youtube-VOS data set, the pseudo label generation model trained by the method is used, the AGAME video target segmentation model is trained by adopting the 'cooperation teaching' algorithm, the performance of 59.5% mIoU (average cross-over ratio) can be obtained, and 90.6% of the performance of the model trained by using fine mask label (65.7% mIoU) is achieved, so that the effectiveness and the practicability of the border label-based weak surveillance video target segmentation method are verified.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

Based on the same idea as the frame annotation based weak surveillance video target segmentation method in the embodiment, the invention further provides a frame annotation based weak surveillance video target segmentation system, which can be used for executing the frame annotation based weak surveillance video target segmentation method. For convenience of explanation, in the structural schematic diagram of the weakly supervised video object segmentation system based on frame labeling in the present embodiment, only the parts related to the embodiment of the present invention are shown, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the apparatus, and may include more or less components than those illustrated, or combine some components, or arrange different components.

As shown in fig. 4, in another embodiment of the present application, a weakly supervised video object segmentation system 100 based on bounding box labeling is provided, which includes a pseudo label generation model training module 101, a pseudo mask label generation module 102 and a segmentation model training module 103;

the pseudo-label generation model training module 101 is configured to train a pseudo-label generation model based on a PReMVS model on the image segmentation dataset;

the pseudo mask label generating module 102 is configured to generate a corresponding pseudo mask label for the video data and the frame label frame by using a pseudo label generating model;

the segmentation model training module 103 is configured to train a video object segmentation model using a "collaborative teaching" algorithm using the generated pseudo-mask labels.

It should be noted that, the frame annotation based weak surveillance video object segmentation system of the present invention corresponds to the frame annotation based weak surveillance video object segmentation method of the present invention one to one, and the technical features and the beneficial effects thereof described in the above-mentioned frame annotation based weak surveillance video object segmentation method are all applicable to the frame annotation based weak surveillance video object segmentation system, and specific contents can refer to the descriptions in the embodiment of the method of the present invention, and are not repeated here, and thus it is stated here.

In addition, in the implementation of the above-mentioned weak surveillance video object segmentation system based on frame labeling, the logical division of each program module is only an example, and in practical applications, the above-mentioned function distribution may be performed by different program modules according to needs, for example, due to the configuration requirements of corresponding hardware or the convenience of implementation of software, that is, the internal structure of the weak surveillance video object segmentation system based on frame labeling is divided into different program modules to perform all or part of the above-mentioned functions.

As shown in fig. 5, in another embodiment, a computer-readable storage medium 200 is further provided, in which a program 201 is stored in a memory, and when the program 201 is executed by a processor 202, the method for weakly supervised video object segmentation based on bounding box labeling is implemented, specifically:

training a pseudo-label generation model based on a PReMVS model on an image segmentation data set;

and training a video target segmentation model by using a 'cooperation teaching' algorithm by using the generated pseudo mask label.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The weak surveillance video target segmentation method based on frame labeling is characterized by comprising the following steps of:

2. The method for segmenting the weakly supervised video target based on the frame annotation as recited in claim 1, wherein the training of the pseudo annotation generation model based on the primvos model on the image segmentation dataset means that an optimized redefinement module in the primvos model is used to design the pseudo annotation generation model, and the training is performed on the image segmentation dataset Mapillary dataset, and specifically includes:

3. The frame labeling-based weak surveillance video target segmentation method according to claim 2, wherein the labeled frame is appropriately enlarged by the following specific method:

4. The frame labeling-based weak surveillance video target segmentation method according to claim 1, wherein a segmentation network structure used by the pseudo labeling generation model is a deep-v 3+ structure, and a pixel-by-pixel cross entropy function is used as a loss function for model training.

5. The frame annotation based weak surveillance video object segmentation method of claim 1, wherein the pseudo-annotation generation model is used to generate corresponding pseudo-mask annotations frame by frame for the video data and the frame annotation, specifically:

6. The border labeling-based weakly supervised video object segmentation method of claim 5, wherein ffmpeg tool is used to convert video data into image frames.

7. The frame labeling-based weak surveillance video object segmentation method according to claim 5, wherein the generated pseudo mask labeling is used to train a video object segmentation model using a "cooperative teaching" algorithm, specifically:

after the preliminary training is finished, entering a 'cooperation teaching' training stage, sampling a batch of training samples in each training iteration, respectively carrying out primary segmentation on the training samples by using models A and B, and calculating frame-by-frame image-by-image between a segmentation result and a generated pseudo mask labelSorting the cross entropy loss function values of the pixels in the frame marking area according to the output result of the model A, taking R (T)% of the pixels with smaller loss function values, and marking the training model B by using a pseudo mask corresponding to the pixels; the above-mentioned

Wherein T is the current training iteration step number, T _k τ is a parameter for controlling the maximum value of R (T) for controlling the rate of increase of R (T);

8. The method of claim 7, wherein the R (T)% parameter is gradually increased with the number of training iteration steps, that is, more and more pseudo mask labels are screened out to participate in the training model during the training process.

9. The weak supervision video target segmentation system based on the frame mark is applied to the weak supervision video target segmentation method based on the frame mark in any one of claims 1 to 8, and is characterized by comprising a pseudo mark generation model training module, a pseudo mask mark generation module and a segmentation model training module;

10. A computer-readable storage medium containing a program, wherein the program is configured to implement the border-labeling-based weakly-supervised video object segmentation method of any one of claims 1 to 8 when executed.