CN108898618B

CN108898618B - Weak surveillance video object segmentation method and device

Info

Publication number: CN108898618B
Application number: CN201810573374.0A
Authority: CN
Inventors: 张宗璞; 马汝辉; 华扬; 宋涛; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2021-09-24
Anticipated expiration: 2038-06-06
Also published as: CN108898618A

Abstract

The invention discloses a method for segmenting a video object under weak supervision, which comprises the steps of constructing a video object segmentation model, inputting a first frame of a test video and a test object boundary box, and pre-training the video object segmentation model based on an iterative algorithm; tracking a test object boundary frame in each frame after the first frame of the test video; predicting the pixel level of each frame after the first frame of the test video to generate an image mask containing foreground and background information of the current frame; and optimizing the image mask based on the tracking result to obtain the final object segmentation calculation result of the current frame. The invention also discloses a device for segmenting the weakly supervised video object, which comprises a weakly supervised video object segmentation pre-training module, a video object tracking module, a video object segmentation testing module and a video object segmentation optimization module. The method reduces the labor cost required by video object segmentation and improves the performance of video object segmentation.

Description

Weak surveillance video object segmentation method and device

Technical Field

The invention relates to the field of video processing of computer vision directions, in particular to a method and a device for segmenting a weakly supervised video object.

Background

Video Object Segmentation (VOS) is one of the hot research issues in the field of computer Video processing. Video object segmentation may generate an image mask containing object foreground and background information for each frame in the video. The mask generated by the video object segmentation framework has a wide range of applications, such as video editing, automatic driving, video monitoring, video coding applications based on video content, and the like. Currently, mainstream methods in this field include Supervised (super VOS), Unsupervised (Unsupervised VOS), and Semi-Supervised (Semi-Supervised VOS) video object segmentation frameworks.

The supervised video object segmentation framework assumes that all video frames in the current test sequence have pre-knowledge of manual calibration and cooperatively generates and improves an image mask for video object segmentation through interaction with a user. Such methods are not suitable for automated video object segmentation tasks, as they typically perform poorly in the absence of human intervention. The unsupervised video object segmentation framework performs object segmentation directly on each video frame, assuming that there is no prior knowledge of the current video sequence. Such frameworks tend to suffer from poor performance due to lack of contextual information for the current video sequence, due to misinterpretation of extraneous objects.

The semi-supervised video object segmentation framework assumes that in the current test sequence, the artificial calibration information for the first frame is already given. The framework improves the testing accuracy of the segmentation framework in the current video sequence by learning the information in the first frame. However, similar to the supervised object segmentation framework, the semi-supervised framework requires manual calibration of the image mask of the first frame. Since the calibration work is time-consuming and labor-consuming, the application space of the frame in real life is limited. Meanwhile, because a semi-supervised frame is difficult to control the occurrence of an Over-Fitting (Over Fitting) phenomenon in the process of learning a first frame, the situation that the object mask in the image mask is incomplete often occurs when the frame is used for testing video frames after a first frame, and the video object segmentation performance is obviously influenced.

In addition to Video Object segmentation, Video Object Tracking (VOT) is another hot research problem in the field of computer Video processing. Video object tracking generates a bounding box surrounding the test object for each frame in the video, and continuously updates the position and size of the bounding box according to the relation between the previous frame and the next frame of the video.

Therefore, those skilled in the art are dedicated to develop a method and a device for segmenting a weakly supervised video object, an object bounding box is fused with a video object segmentation device, and video object tracking is used for assisting video object segmentation, so that the manual calibration cost is reduced, introduction of irrelevant objects is avoided, and the video object segmentation performance is improved.

Disclosure of Invention

In view of the above defects in the prior art, the technical problems to be solved by the present invention are how to reduce the manual calibration cost, how to avoid the introduction of irrelevant objects, and how to improve the video object segmentation performance.

In order to achieve the above object, the present invention provides a method for segmenting an object in a weakly supervised video, comprising the following steps:

s01, constructing a video object segmentation model, inputting a first frame of a test video and a test object boundary box in the first frame, and pre-training the video object segmentation model based on an iterative algorithm;

s02, tracking the test object boundary box in each frame after the first frame of the test video, and updating the test object boundary box;

s03, based on the test object bounding box output in step S02, performing pixel level prediction on each frame after the first frame of the test video, and generating an image mask containing foreground and background information of a current frame;

s04 optimizes the image mask output in step S03 based on the result output in step S02 to obtain the final object segmentation calculation result for the current frame.

Further, in step S01, the test object bounding box in the first frame is obtained by manual calibration.

Further, in step S01, the pre-training the video object segmentation model based on the iterative algorithm includes:

s11, generating an image mask for a first frame of the test video by using the current video object segmentation model;

s12, optimizing the image mask of the first frame of the test video based on the test object bounding box in the first frame;

s13, training the current video object segmentation model by using the optimized image mask;

s14 repeats steps S11 to S13, and ends after the number of iterations is reached.

Further, in step S12, the image mask of the first frame of the test video is optimized, including deleting the extraneous object and completing the missing part of the test object.

Further, in step S02, the test object bounding box includes position information and size information of the test object.

Further, in step S03, the predicted range is a sub-region near the test object, which is given by the test object bounding box.

Further, in step S04, optimizing the image mask output in step S03 includes:

removing the irrelevant objects;

optimizing the edge of the test object according to the test object bounding box;

and smoothing the defect of the test object according to the test object boundary frame.

The invention also discloses a device for segmenting the weakly supervised video object, which comprises a weakly supervised video object segmentation pre-training module, a video object tracking module, a video object segmentation testing module and a video object segmentation optimization module:

the weak surveillance video object segmentation pre-training module is used for pre-training a video object segmentation model by inputting a test object boundary box and a test video first frame and utilizing an iterative algorithm;

the video object tracking module is used for tracking the bounding box of the test object in each frame after the first frame of the test video, so as to accurately predict the position and the size of the object;

the video object segmentation test module is used for predicting each frame after the first frame of the test video at the pixel level to generate an image mask for distinguishing the foreground from the background;

the video object segmentation optimization module optimizes the image mask generated by the video object segmentation test module by using the test object bounding box generated by the video object tracking module.

Further, the present invention discloses a weak surveillance video object segmentation apparatus, which comprises the following steps:

step 1, a weak surveillance video object segmentation device starts to operate, a weak surveillance video object segmentation pre-training module generates a first frame image mask used for pre-training a video segmentation model based on an iterative algorithm by using a test object boundary frame and a test video first frame, and executes pre-training;

step 2, the video object tracking module tracks the bounding box of the test object in each frame after the first frame of the test video and transmits the tracking result to the video object segmentation test module;

step 3, the video object segmentation test module performs pixel-level prediction on sub-regions near the test object in each frame after the first frame of the test video based on the test object bounding box given by the video object tracking module to generate an image mask containing foreground and background information, and transmits the image mask to the video object segmentation optimization module;

and 4, optimizing the image mask generated by the video object segmentation testing module by the video object segmentation optimizing module by utilizing the testing object bounding box output by the video object tracking module to generate a final object segmentation calculation result for the current frame.

Further, if the test video has a frame to be tested, the weak surveillance video object segmentation device repeats the steps 2 to 4.

The method and the device for segmenting the weakly supervised video object disclosed by the invention have the following technical effects:

(1) according to the invention, the manual calibration object boundary frame of the first frame is fed out instead of the complete image mask in the current test sequence, so that the manual calibration cost required by the frame operation can be obviously reduced, and the application research of related work in real life can be promoted.

(2) The invention can learn the context information of the current sequence from the first frame of the video sequence, avoid the defect caused by irrelevant objects and improve the performance of the video object segmentation test.

(3) The video object tracking-assisted video object segmentation optimization module can utilize the object position and size information from the object boundary frame to avoid the defect of incomplete object masks. Meanwhile, according to the object position, the possibility that an irrelevant object is introduced into the object mask is reduced, and the performance of video object segmentation is further improved

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a flow chart of a video object segmentation method according to a preferred embodiment of the present invention;

FIG. 2 is a flow chart of an iterative algorithm according to a preferred embodiment of the present invention;

FIG. 3 is a block diagram of a video object segmentation apparatus according to a preferred embodiment of the present invention;

fig. 4 is a schematic flowchart of the video object segmentation apparatus according to a preferred embodiment of the present invention.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

Fig. 1 is a schematic flow chart of a method for segmenting a weakly supervised video object according to a preferred embodiment of the present invention, which includes the following steps:

in the embodiment, the test object boundary frame is given in an artificial calibration mode instead of a complete image mask, so that the required artificial calibration cost is obviously reduced, and the application research of related work in real life is facilitated.

video object tracking is a hot research problem in the field of computer video processing, a bounding box surrounding a test object is generated for each frame in a test video, and the position and the size of the bounding box of the test object are continuously updated according to the relation between the front frame and the rear frame of the test video.

the sub-region near the test object is given by the object boundary box controlled by the video object tracking module, only the sub-region near the test object is predicted, and the possibility that an irrelevant object is introduced into the current frame image mask can be reduced from the source.

S04, based on the result output in S02, optimizing the image mask output in S03 to obtain the final object segmentation calculation result of the current frame;

the optimization mainly comprises the following steps: removing extraneous objects, optimizing the edges of the test object in the mask image code according to the test object bounding box, smoothing the defects of the test object in the image mask according to the test object bounding box, and the like. The optimized image mask is used as the final object segmentation computation result for the current frame.

Fig. 2 is a schematic flowchart of an iterative algorithm according to a preferred embodiment of the present invention, which is based on an iterative algorithm to pre-train a video object segmentation model, and includes the following steps:

In this embodiment, the iterative generation algorithm first generates an image mask for a first frame of a current test video using an original object segmentation model. And then optimizing the image mask of the first frame by using a test object boundary frame, wherein the optimization comprises deletion of irrelevant objects and completion of missing parts of the test objects. The optimized image mask of the first frame is used as training calibration data to train an original object segmentation model, and then the work flow of repeatedly generating the image mask of the first frame, optimizing the image mask of the first frame and training the current video object segmentation model is repeated. After N iterations, the finally optimized mask is used as a predicted value of the first frame of the current test video, the more the iteration times are, the longer the pre-training time is, and the better the performance of the pre-trained video object segmentation model on the subsequent video frames is. In this embodiment, the pseudo code of the main program of the iterative algorithm is as follows:

fig. 3 is a system diagram of a video object segmentation apparatus according to a preferred embodiment of the present invention, which includes a weak surveillance video object segmentation pre-training module, a video object tracking module, a video object segmentation testing module, and a video object segmentation optimization module:

Fig. 4 is a schematic flowchart of the operation of the video object segmentation apparatus according to a preferred embodiment of the present invention, including:

step 4, the video object segmentation optimization module optimizes the image mask generated by the video object segmentation test module by using the test object bounding box output by the video object tracking module to generate a final object segmentation calculation result for the current frame; and if the frame to be tested exists in the test video, repeating the step 2 to the step 4.

In this embodiment, the weak surveillance video object segmentation pre-training module only needs to manually calibrate the bounding box of the test object, but not the image mask of the first frame of the test video, so that the manual calibration cost is significantly reduced, and meanwhile, the weak surveillance video object segmentation pre-training module can learn the context information of the current sequence from the first frame of the test video, thereby avoiding the defect of introduction of an unrelated object and improving the performance of the video object segmentation device.

In the embodiment, a video object segmentation optimization module based on video object tracking assistance is introduced, and by utilizing the position information and the size information of the boundary box of the test object, the defect that the image mask of the current frame is incomplete can be avoided.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A weak surveillance video object segmentation method is characterized by comprising the following steps:

2. The method for weakly supervised video object segmentation as recited in claim 1, wherein in step S01, the test object bounding box in the first frame is obtained by means of manual calibration.

3. The method for weakly supervised video object segmentation as recited in claim 1, wherein in step S01, the step of pre-training the video object segmentation model based on the iterative algorithm includes:

4. The method for weakly supervised video object segmentation as recited in claim 3, wherein in step S12, the image mask of the first frame of the test video is optimized, including removing extraneous objects and completing the missing part of the test object.

5. The weak surveillance video object segmentation method of claim 1, wherein in step S02, the test object bounding box includes position information and size information of the test object.

6. The weakly supervised video object segmentation method of claim 1, wherein in step S03, the predicted range is a sub-region in the vicinity of the test object, the sub-region being given by the test object bounding box.

7. The weak surveillance video object segmentation method of claim 1, wherein in step S04, optimizing the image mask output in step S03 includes:

removing the irrelevant objects;

8. The utility model provides a weak supervision video object segmenting device which characterized in that, includes weak supervision video object segmentation pre-training module, video object tracking module, video object segmentation test module and video object segmentation optimization module:

9. A weakly supervised video object segmentation apparatus as claimed in claim 8, wherein the workflow includes:

10. The apparatus of claim 9, wherein if there is a frame to be tested in the test video, repeating steps 2-4.