CN111091157B

CN111091157B - Video self-supervision learning method based on shape-completion gap-filling task

Info

Publication number: CN111091157B
Application number: CN201911348018.XA
Authority: CN
Inventors: 王伟平; 罗德昭; 刘畅; 周宇; 杨东宝
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2023-03-10
Anticipated expiration: 2039-12-24
Also published as: CN111091157A

Abstract

The invention provides a video self-supervision learning method based on a shape-completion and space-filling task, which belongs to the field of digital video self-supervision.

Description

Video self-supervision learning method based on shape-completion gap-filling task

Technical Field

The invention belongs to the field of digital video self-supervision, and particularly relates to a video self-supervision learning method based on a shape completion and gap filling task.

Background

Convolutional neural networks have advanced the computer vision field over the last few years. In solving vision tasks, models are usually initialized with neural networks pre-trained on large-scale datasets such as ImageNet and Kinetics. These networks have rich feature representation capability, but require a large number of manual annotations, and people need to learn rich feature representation in an automatic supervision manner without data annotation. At present, in the field of video self-supervision, a relatively mature method is to disorder the sequence of video frames, use the sequence as supervision information, and judge the sequence information through a training network to promote the network to learn the spatiotemporal characteristics of videos.

The existing supervision training method has the following defects:

1. a large number of manual labels are needed in the fully supervised pre-training process. As the amount of data and task complexity rise, annotating data consumes a significant amount of manpower, making larger scale, more complex data set annotation impossible.

2. In training methods based on video frame sequencing for surveillance, some actions have periodicity (e.g., running, skipping ropes), and the video frames in different periods are sequenced according to the single motion period, which can lead to wrong results, thereby disturbing the understanding of the network to the video content.

3. Certain self-supervision algorithms learn for certain tasks, and the learned features lack generalization performance.

4. The learning effect in the self-supervision process is often judged by initializing downstream tasks (such as behavior recognition and video retrieval), and the degree of the model's comprehension of video features through self-supervision learning cannot be evaluated visually.

Disclosure of Invention

The invention aims to provide a video self-supervision learning method based on a shape filling and gap filling task.

In order to achieve the purpose, the invention adopts the following technical scheme:

a video self-supervision learning method based on a complete shape filling task comprises the following steps:

randomly intercepting video sample segments with fixed frame numbers from an original video, and equally dividing the video sample segments into a plurality of video segments;

randomly and equally probabilistically selecting one video segment, reserving blank, and taking other unselected video segments as a question stem;

performing time or/and space conversion operation on the selected video segment, and filling the converted video segment back to the blank position;

inputting all video segments into a feature extraction network respectively, and extracting the deep network features of all the video segments;

inputting the deep network characteristics into a classifier, judging the transformation operation of the selected video segment, adjusting the parameters of the characteristic extraction network and the classifier according to the judgment result, and completing the self-supervision learning of the video.

Preferably, the spatial transformation operation includes spatial rotation and spatial arrangement, the spatial rotation is to rotate the video image by a certain angle, the spatial arrangement is to cut the video image into a plurality of blocks, and the blocks are rearranged.

Preferably, the spatial rotation is to rotate the video image by 90 degrees, 180 degrees or 270 degrees, and the spatial arrangement is to slice the video image into four tiles.

Preferably, the time-shift operation includes a time-sequential remote shuffling for replacing the selected video segment with a forward or backward video segment at a time interval, and a time-sequential adjacent shuffling for dividing the selected video segment into a plurality of sub-segments, wherein at least two sub-segments are randomly exchanged once.

Preferably, the extracted deep network features of the video segments are spliced, and the spliced deep network features are input into the classifier.

Preferably, the deep network features are input into a classifier, the classifier is used for judging the probability values (the range is 0-1) of the deep network features belonging to different transformation operations, the transformation operation with the maximum probability value is regarded as the transformation operation performed on the selected video segment, and the parameters of the network and the classifier are extracted according to the probability values.

A video auto-supervised learning system based on a full shape fill-in-space task, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for performing the steps of the above method.

A computer readable storage medium storing a computer program comprising instructions which, when executed by a processor of a system, cause the system to perform the steps of the above method.

Drawings

Fig. 1 is a flow chart of a video self-supervision learning method based on a complete shape filling task.

FIGS. 2A-2E are graphs of video spatiotemporal feature learning effects of a verification self-surveillance method.

Fig. 3 is a comparison diagram of video retrieval visualization of a video self-supervised learning method and other self-supervised methods.

Detailed Description

In order to make the technical scheme of the invention more obvious and understandable, specific embodiments are described in detail below with reference to the accompanying drawings.

The embodiment provides a video self-supervision learning method based on a complete shape filling task, as shown in fig. 1, including the following steps:

1. randomly intercepting video sample segments with fixed frame numbers in a section of original video, and equally dividing the video sample segments into three sections;

2. randomly and equally probabilistically selecting a video section, reserving the video section, reserving a blank, and taking the other two video sections as a question stem;

3. performing temporal or spatial conversion operation on the selected video segment, and filling the converted video segment back to the blank position;

4. inputting the three sections of videos into the same feature extraction network respectively, extracting respective deep network features, and then splicing the features;

5. the spliced features are input into a classifier, and the classifier judges the transformation operation on the selected video segment.

The method comprises the steps of simulating the blank leaving operation of shape filling by reserving a video segment; then creating options by performing a space-time transform on the video segment; and finally, filling the generated options into blanks in sequence, and learning the video characteristics by predicting the preprocessing operation performed on the video frequency band.

In order to train the model to select correct answers from all options so as to learn rich video features, the method adopts spatial and temporal transformation operation to create options, and each option can confuse network problems from corresponding feature dimensions. Under this premise, the method employs 4 operations, including spatial rotation, spatial arrangement, time sequence remote shuffling, and time sequence adjacent shuffling.

To provide an option to focus on spatial representation learning, the method employs spatial rotation and spatial arrangement. Using spatial rotation, the video clip will rotate 90 degrees, 180 degrees, or 270 degrees, forcing the model to learn the features related to orientation. By spatial arrangement, the video segment is divided into four tiles, any two tiles being permuted to generate a new selection. Replacement with two tiles results in the option of partially preserving spatial structure information, thereby preventing the model from learning low-level statistics to distinguish spatial clutter.

To provide options that focus on temporal features, the method further employs two temporal operations: one operation is temporal remote shuffling, where a selected video segment will be replaced by a video segment with a larger temporal distance forward or backward, since the background of frames with reasonable temporal distance may be similar, which means distinctively in the foreground, this option drives the model to learn more temporal information about the foreground. Another operation is temporal neighbor shuffling, in which the selected video segment is divided into four sub-segments, and in which two sub-segments are randomly swapped once, unlike previous methods, where all sub-segments are not shuffled, but rather the difficulty is reduced by training the model to determine if the segments are shuffled rather than predicting the exact order.

Verification experiments for the method of the invention (hereinafter referred to as VCP):

extensive experiments were performed to evaluate the effectiveness of VCP and its application on downstream tasks. First, the VCP's representation learning is evaluated using different option configurations and data policies; then further using VCP for model evaluation experiments; finally, the performance of the VCP application to downstream tasks (i.e., action recognition and video retrieval) is evaluated and compared to the latest methods. The experiments were performed on UCF101 and HMDB51 datasets. UCF101 contains 13320 videos in 101 action categories, presenting challenging issues including intra-action class differences, complex camera actions and cluttered background. The HMDB51 contains 6849 videos from over 51 action categories, mainly from movies and websites.

Table 1 shows the expandability of the model and the synergy brought to the behavior recognition task by expanding the options.

TABLE 1 extensibility verification

Method	Accuracy (%)
		Random initialization	62.0
Spatial rotation	64.3
		Spatial arrangement	63.4
Spin + arrange	66.0
		Time sequence remote reorganization	67.8
Sequential adjacent shuffling	65.0
		Time sequential originality + neighbor shuffling	68.0
All of	69.7

As can be seen from table 1, when the self-supervised learning of the spatial rotation or spatial arrangement operation is performed alone, the accuracy of the motion recognition is improved by 2.3% or 1.4% compared with the random initialization. When the spatial operation (spatial rotation + spatial arrangement) is used simultaneously, the performance is further improved to 66.0%. When the time sequence remote reorganization or the adjacent reorganization operation is used alone for self-supervision learning, the performance can be improved by 5.8 percent or 3.0 percent, and when the time operation (remote reorganization and adjacent reorganization) is used at the same time, the performance is further improved to 68.0 percent. Meanwhile, the performance is finally improved to 69.7% by combining with space-time operation, which is obviously better than that of random initialization by 7.7%. Experiments have shown that these options can be used flexibly, including alone or in combination with each other. VCP can learn more representative features by adding rich and complementary options.

FIGS. 2A-2E show the evaluation effect of the model on the space-time tasks of other tasks respectively, wherein the abscissa in the graph is the training times and the ordinate is the accuracy. As can be seen, VCP can predict spatial rotation, spatial arrangement and temporal neighbor shuffling with high accuracy (around 90%), which is much better than the existing methods; the VCP can achieve better identification of original video and time-series remotely reorganized video, which is better than the existing method. It can also be seen that the accuracy of VCP for original video and temporal remote shuffling is inversely related, demonstrating confusion and great difficulty in classification. In contrast, the accuracy of the existing methods VCOP and 3Dcubic puzzle is divergent, meaning that both methods do not classify the original video and the remote reorganization. The prior methods ST-Puzzle and S-Puzzle are superior to T-Puzze and VCAP in spatial operational classification, and inferior to T-Puzze and VCAP in temporal operational classification. The results indicate that spatial characterization learning is inconsistent with temporal characterization learning. The VCP can classify spatial and temporal transform operations, and the recognition result is better than the existing methods.

Table 2 shows the effect of comparing scores of behavior recognition with other models.

TABLE 2 behavior recognition Performance of the self-supervision method

Method	UCF101 accuracy	HMDB51 accuracy
			Random initialization	61.8	24.7
VCOP	65.6	28.4
			VCP	68.5	32.5

As can be seen from table 2, the performance of VCP on UCF101 and HMDB51 datasets was 6.7% and 7.8% higher than that of random initialization, and 2.9% and 4.1% higher than that of the best current VCOP method. Good performance verifies that VCP can learn richer, more distinguishing characteristics than other methods.

Table 3 shows the effect on video retrieval with other models.

TABLE 3 video detection Performance of the self-supervision method

Method	Top1(％)	Top5(％)	Top10(％)	Top20(％)	Top50(％)
						Random initialization	16.7	27.5	33.7	41.4	53.0
VCOP	12.5	29.0	39.0	50.6	66.9
						VCP	17.3	31.5	42.0	52.6	67.7

As can be seen from table 3, VCP performed best on all evaluation indexes such as matching accuracy of the first 1, 5, 10, 20, and 50 names, compared to the random initialization and VCOP, and thus the superiority of VCP was seen.

Fig. 3 shows the visualization of the video retrieval. Searching the input video in the data set, and respectively searching the video content obtained by VCAP and VCP under the same searching input, wherein if the category of the searched video is consistent with that of the input video, the searching is successful, the more the searching is successful, the better the performance is, and the VCP searching result is obviously better than the VCAP.

The above embodiments are only intended to illustrate the technical solution of the present invention, but not to limit it, and a person skilled in the art can modify the technical solution of the present invention or substitute it with an equivalent, and the protection scope of the present invention is subject to the claims.

Claims

1. A video self-supervision learning method based on a shape-completion gap-filling task is characterized by comprising the following steps:

inputting the deep network characteristics into a classifier, judging probability values of the deep network characteristics belonging to different transformation operations through the classifier, regarding the transformation operation with the maximum probability value as the transformation operation of the selected video segment, adjusting the characteristics according to the probability values to extract parameters of the network and the classifier, and completing the self-supervision learning of the video.

2. The method of claim 1, wherein the spatial transform operation includes a spatial rotation and a spatial arrangement, the spatial rotation is performed by rotating the video image by a certain angle, the spatial arrangement is performed by cutting the video image into a plurality of blocks, and the blocks are rearranged.

3. The method of claim 2, wherein the spatial rotation is a rotation of the video image by 90 degrees, 180 degrees, or 270 degrees.

4. The method of claim 1, wherein the temporal transform operation comprises a temporal remote shuffling that replaces the selected video segment with a forward or backward video segment for a time interval and a temporal adjacent shuffling that divides the selected video segment into sub-segments, wherein at least two of the sub-segments are randomly swapped.

5. The method of claim 1, wherein the extracted deep network features of the video segments are stitched and the stitched deep network features are input into the classifier.

6. A video auto-supervised learning system based on a complete fill-in-the-blank task, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for carrying out the steps of the method of any one of claims 1 to 5.

7. A computer-readable storage medium storing a computer program, characterized in that the computer program comprises instructions which, when executed by a processor of a system, cause the system to perform the steps of the method of any of claims 1 to 5.