CN111091157B - Video self-supervision learning method based on shape-completion gap-filling task - Google Patents

Video self-supervision learning method based on shape-completion gap-filling task Download PDF

Info

Publication number
CN111091157B
CN111091157B CN201911348018.XA CN201911348018A CN111091157B CN 111091157 B CN111091157 B CN 111091157B CN 201911348018 A CN201911348018 A CN 201911348018A CN 111091157 B CN111091157 B CN 111091157B
Authority
CN
China
Prior art keywords
video
segments
spatial
video segment
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911348018.XA
Other languages
Chinese (zh)
Other versions
CN111091157A (en
Inventor
王伟平
罗德昭
刘畅
周宇
杨东宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201911348018.XA priority Critical patent/CN111091157B/en
Publication of CN111091157A publication Critical patent/CN111091157A/en
Application granted granted Critical
Publication of CN111091157B publication Critical patent/CN111091157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention provides a video self-supervision learning method based on a shape-completion and space-filling task, which belongs to the field of digital video self-supervision.

Description

Video self-supervision learning method based on shape-completion gap-filling task
Technical Field
The invention belongs to the field of digital video self-supervision, and particularly relates to a video self-supervision learning method based on a shape completion and gap filling task.
Background
Convolutional neural networks have advanced the computer vision field over the last few years. In solving vision tasks, models are usually initialized with neural networks pre-trained on large-scale datasets such as ImageNet and Kinetics. These networks have rich feature representation capability, but require a large number of manual annotations, and people need to learn rich feature representation in an automatic supervision manner without data annotation. At present, in the field of video self-supervision, a relatively mature method is to disorder the sequence of video frames, use the sequence as supervision information, and judge the sequence information through a training network to promote the network to learn the spatiotemporal characteristics of videos.
The existing supervision training method has the following defects:
1. a large number of manual labels are needed in the fully supervised pre-training process. As the amount of data and task complexity rise, annotating data consumes a significant amount of manpower, making larger scale, more complex data set annotation impossible.
2. In training methods based on video frame sequencing for surveillance, some actions have periodicity (e.g., running, skipping ropes), and the video frames in different periods are sequenced according to the single motion period, which can lead to wrong results, thereby disturbing the understanding of the network to the video content.
3. Certain self-supervision algorithms learn for certain tasks, and the learned features lack generalization performance.
4. The learning effect in the self-supervision process is often judged by initializing downstream tasks (such as behavior recognition and video retrieval), and the degree of the model's comprehension of video features through self-supervision learning cannot be evaluated visually.
Disclosure of Invention
The invention aims to provide a video self-supervision learning method based on a shape filling and gap filling task.
In order to achieve the purpose, the invention adopts the following technical scheme:
a video self-supervision learning method based on a complete shape filling task comprises the following steps:
randomly intercepting video sample segments with fixed frame numbers from an original video, and equally dividing the video sample segments into a plurality of video segments;
randomly and equally probabilistically selecting one video segment, reserving blank, and taking other unselected video segments as a question stem;
performing time or/and space conversion operation on the selected video segment, and filling the converted video segment back to the blank position;
inputting all video segments into a feature extraction network respectively, and extracting the deep network features of all the video segments;
inputting the deep network characteristics into a classifier, judging the transformation operation of the selected video segment, adjusting the parameters of the characteristic extraction network and the classifier according to the judgment result, and completing the self-supervision learning of the video.
Preferably, the spatial transformation operation includes spatial rotation and spatial arrangement, the spatial rotation is to rotate the video image by a certain angle, the spatial arrangement is to cut the video image into a plurality of blocks, and the blocks are rearranged.
Preferably, the spatial rotation is to rotate the video image by 90 degrees, 180 degrees or 270 degrees, and the spatial arrangement is to slice the video image into four tiles.
Preferably, the time-shift operation includes a time-sequential remote shuffling for replacing the selected video segment with a forward or backward video segment at a time interval, and a time-sequential adjacent shuffling for dividing the selected video segment into a plurality of sub-segments, wherein at least two sub-segments are randomly exchanged once.
Preferably, the extracted deep network features of the video segments are spliced, and the spliced deep network features are input into the classifier.
Preferably, the deep network features are input into a classifier, the classifier is used for judging the probability values (the range is 0-1) of the deep network features belonging to different transformation operations, the transformation operation with the maximum probability value is regarded as the transformation operation performed on the selected video segment, and the parameters of the network and the classifier are extracted according to the probability values.
A video auto-supervised learning system based on a full shape fill-in-space task, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for performing the steps of the above method.
A computer readable storage medium storing a computer program comprising instructions which, when executed by a processor of a system, cause the system to perform the steps of the above method.
Drawings
Fig. 1 is a flow chart of a video self-supervision learning method based on a complete shape filling task.
FIGS. 2A-2E are graphs of video spatiotemporal feature learning effects of a verification self-surveillance method.
Fig. 3 is a comparison diagram of video retrieval visualization of a video self-supervised learning method and other self-supervised methods.
Detailed Description
In order to make the technical scheme of the invention more obvious and understandable, specific embodiments are described in detail below with reference to the accompanying drawings.
The embodiment provides a video self-supervision learning method based on a complete shape filling task, as shown in fig. 1, including the following steps:
1. randomly intercepting video sample segments with fixed frame numbers in a section of original video, and equally dividing the video sample segments into three sections;
2. randomly and equally probabilistically selecting a video section, reserving the video section, reserving a blank, and taking the other two video sections as a question stem;
3. performing temporal or spatial conversion operation on the selected video segment, and filling the converted video segment back to the blank position;
4. inputting the three sections of videos into the same feature extraction network respectively, extracting respective deep network features, and then splicing the features;
5. the spliced features are input into a classifier, and the classifier judges the transformation operation on the selected video segment.
The method comprises the steps of simulating the blank leaving operation of shape filling by reserving a video segment; then creating options by performing a space-time transform on the video segment; and finally, filling the generated options into blanks in sequence, and learning the video characteristics by predicting the preprocessing operation performed on the video frequency band.
In order to train the model to select correct answers from all options so as to learn rich video features, the method adopts spatial and temporal transformation operation to create options, and each option can confuse network problems from corresponding feature dimensions. Under this premise, the method employs 4 operations, including spatial rotation, spatial arrangement, time sequence remote shuffling, and time sequence adjacent shuffling.
To provide an option to focus on spatial representation learning, the method employs spatial rotation and spatial arrangement. Using spatial rotation, the video clip will rotate 90 degrees, 180 degrees, or 270 degrees, forcing the model to learn the features related to orientation. By spatial arrangement, the video segment is divided into four tiles, any two tiles being permuted to generate a new selection. Replacement with two tiles results in the option of partially preserving spatial structure information, thereby preventing the model from learning low-level statistics to distinguish spatial clutter.
To provide options that focus on temporal features, the method further employs two temporal operations: one operation is temporal remote shuffling, where a selected video segment will be replaced by a video segment with a larger temporal distance forward or backward, since the background of frames with reasonable temporal distance may be similar, which means distinctively in the foreground, this option drives the model to learn more temporal information about the foreground. Another operation is temporal neighbor shuffling, in which the selected video segment is divided into four sub-segments, and in which two sub-segments are randomly swapped once, unlike previous methods, where all sub-segments are not shuffled, but rather the difficulty is reduced by training the model to determine if the segments are shuffled rather than predicting the exact order.
Verification experiments for the method of the invention (hereinafter referred to as VCP):
extensive experiments were performed to evaluate the effectiveness of VCP and its application on downstream tasks. First, the VCP's representation learning is evaluated using different option configurations and data policies; then further using VCP for model evaluation experiments; finally, the performance of the VCP application to downstream tasks (i.e., action recognition and video retrieval) is evaluated and compared to the latest methods. The experiments were performed on UCF101 and HMDB51 datasets. UCF101 contains 13320 videos in 101 action categories, presenting challenging issues including intra-action class differences, complex camera actions and cluttered background. The HMDB51 contains 6849 videos from over 51 action categories, mainly from movies and websites.
Table 1 shows the expandability of the model and the synergy brought to the behavior recognition task by expanding the options.
TABLE 1 extensibility verification
Method Accuracy (%)
Random initialization 62.0
Spatial rotation 64.3
Spatial arrangement 63.4
Spin + arrange 66.0
Time sequence remote reorganization 67.8
Sequential adjacent shuffling 65.0
Time sequential originality + neighbor shuffling 68.0
All of 69.7
As can be seen from table 1, when the self-supervised learning of the spatial rotation or spatial arrangement operation is performed alone, the accuracy of the motion recognition is improved by 2.3% or 1.4% compared with the random initialization. When the spatial operation (spatial rotation + spatial arrangement) is used simultaneously, the performance is further improved to 66.0%. When the time sequence remote reorganization or the adjacent reorganization operation is used alone for self-supervision learning, the performance can be improved by 5.8 percent or 3.0 percent, and when the time operation (remote reorganization and adjacent reorganization) is used at the same time, the performance is further improved to 68.0 percent. Meanwhile, the performance is finally improved to 69.7% by combining with space-time operation, which is obviously better than that of random initialization by 7.7%. Experiments have shown that these options can be used flexibly, including alone or in combination with each other. VCP can learn more representative features by adding rich and complementary options.
FIGS. 2A-2E show the evaluation effect of the model on the space-time tasks of other tasks respectively, wherein the abscissa in the graph is the training times and the ordinate is the accuracy. As can be seen, VCP can predict spatial rotation, spatial arrangement and temporal neighbor shuffling with high accuracy (around 90%), which is much better than the existing methods; the VCP can achieve better identification of original video and time-series remotely reorganized video, which is better than the existing method. It can also be seen that the accuracy of VCP for original video and temporal remote shuffling is inversely related, demonstrating confusion and great difficulty in classification. In contrast, the accuracy of the existing methods VCOP and 3Dcubic puzzle is divergent, meaning that both methods do not classify the original video and the remote reorganization. The prior methods ST-Puzzle and S-Puzzle are superior to T-Puzze and VCAP in spatial operational classification, and inferior to T-Puzze and VCAP in temporal operational classification. The results indicate that spatial characterization learning is inconsistent with temporal characterization learning. The VCP can classify spatial and temporal transform operations, and the recognition result is better than the existing methods.
Table 2 shows the effect of comparing scores of behavior recognition with other models.
TABLE 2 behavior recognition Performance of the self-supervision method
Method UCF101 accuracy HMDB51 accuracy
Random initialization 61.8 24.7
VCOP 65.6 28.4
VCP 68.5 32.5
As can be seen from table 2, the performance of VCP on UCF101 and HMDB51 datasets was 6.7% and 7.8% higher than that of random initialization, and 2.9% and 4.1% higher than that of the best current VCOP method. Good performance verifies that VCP can learn richer, more distinguishing characteristics than other methods.
Table 3 shows the effect on video retrieval with other models.
TABLE 3 video detection Performance of the self-supervision method
Method Top1(%) Top5(%) Top10(%) Top20(%) Top50(%)
Random initialization 16.7 27.5 33.7 41.4 53.0
VCOP 12.5 29.0 39.0 50.6 66.9
VCP 17.3 31.5 42.0 52.6 67.7
As can be seen from table 3, VCP performed best on all evaluation indexes such as matching accuracy of the first 1, 5, 10, 20, and 50 names, compared to the random initialization and VCOP, and thus the superiority of VCP was seen.
Fig. 3 shows the visualization of the video retrieval. Searching the input video in the data set, and respectively searching the video content obtained by VCAP and VCP under the same searching input, wherein if the category of the searched video is consistent with that of the input video, the searching is successful, the more the searching is successful, the better the performance is, and the VCP searching result is obviously better than the VCAP.
The above embodiments are only intended to illustrate the technical solution of the present invention, but not to limit it, and a person skilled in the art can modify the technical solution of the present invention or substitute it with an equivalent, and the protection scope of the present invention is subject to the claims.

Claims (7)

1. A video self-supervision learning method based on a shape-completion gap-filling task is characterized by comprising the following steps:
randomly intercepting video sample segments with fixed frame numbers from an original video, and equally dividing the video sample segments into a plurality of video segments;
randomly and equally probabilistically selecting one video segment, reserving blank, and taking other unselected video segments as a question stem;
performing time or/and space conversion operation on the selected video segment, and filling the converted video segment back to the blank position;
inputting all video segments into a feature extraction network respectively, and extracting the deep network features of all the video segments;
inputting the deep network characteristics into a classifier, judging probability values of the deep network characteristics belonging to different transformation operations through the classifier, regarding the transformation operation with the maximum probability value as the transformation operation of the selected video segment, adjusting the characteristics according to the probability values to extract parameters of the network and the classifier, and completing the self-supervision learning of the video.
2. The method of claim 1, wherein the spatial transform operation includes a spatial rotation and a spatial arrangement, the spatial rotation is performed by rotating the video image by a certain angle, the spatial arrangement is performed by cutting the video image into a plurality of blocks, and the blocks are rearranged.
3. The method of claim 2, wherein the spatial rotation is a rotation of the video image by 90 degrees, 180 degrees, or 270 degrees.
4. The method of claim 1, wherein the temporal transform operation comprises a temporal remote shuffling that replaces the selected video segment with a forward or backward video segment for a time interval and a temporal adjacent shuffling that divides the selected video segment into sub-segments, wherein at least two of the sub-segments are randomly swapped.
5. The method of claim 1, wherein the extracted deep network features of the video segments are stitched and the stitched deep network features are input into the classifier.
6. A video auto-supervised learning system based on a complete fill-in-the-blank task, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for carrying out the steps of the method of any one of claims 1 to 5.
7. A computer-readable storage medium storing a computer program, characterized in that the computer program comprises instructions which, when executed by a processor of a system, cause the system to perform the steps of the method of any of claims 1 to 5.
CN201911348018.XA 2019-12-24 2019-12-24 Video self-supervision learning method based on shape-completion gap-filling task Active CN111091157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911348018.XA CN111091157B (en) 2019-12-24 2019-12-24 Video self-supervision learning method based on shape-completion gap-filling task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911348018.XA CN111091157B (en) 2019-12-24 2019-12-24 Video self-supervision learning method based on shape-completion gap-filling task

Publications (2)

Publication Number Publication Date
CN111091157A CN111091157A (en) 2020-05-01
CN111091157B true CN111091157B (en) 2023-03-10

Family

ID=70396722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911348018.XA Active CN111091157B (en) 2019-12-24 2019-12-24 Video self-supervision learning method based on shape-completion gap-filling task

Country Status (1)

Country Link
CN (1) CN111091157B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070023A (en) * 2019-04-16 2019-07-30 上海极链网络科技有限公司 A kind of self-supervisory learning method and device based on sequence of motion recurrence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10460765B2 (en) * 2015-08-26 2019-10-29 JBF Interlude 2009 LTD Systems and methods for adaptive and responsive video

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070023A (en) * 2019-04-16 2019-07-30 上海极链网络科技有限公司 A kind of self-supervisory learning method and device based on sequence of motion recurrence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于粗糙集的视频片段自动分类方法;曾晓宁等;《河北科技师范学院学报》;20090315(第01期);全文 *

Also Published As

Publication number Publication date
CN111091157A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN110781843B (en) Classroom behavior detection method and electronic equipment
Zhu et al. Fine-grained video categorization with redundancy reduction attention
Boenisch et al. Tracking all members of a honey bee colony over their lifetime using learned models of correspondence
CN105590099B (en) A kind of more people's Activity recognition methods based on improvement convolutional neural networks
CN109978893A (en) Training method, device, equipment and the storage medium of image, semantic segmentation network
CN108537119B (en) Small sample video identification method
CN110363131B (en) Abnormal behavior detection method, system and medium based on human skeleton
Mascaro et al. Learning abnormal vessel behaviour from AIS data with Bayesian networks at two time scales
US11640714B2 (en) Video panoptic segmentation
CN110674790B (en) Abnormal scene processing method and system in video monitoring
CN104025117A (en) Temporal face sequences
CN111914778A (en) Video behavior positioning method based on weak supervised learning
CN113269103B (en) Abnormal behavior detection method, system, storage medium and equipment based on space map convolutional network
EP4085374A1 (en) System and method for group activity recognition in images and videos with self-attention mechanisms
US9830533B2 (en) Analyzing and exploring images posted on social media
CN115410119A (en) Violent movement detection method and system based on adaptive generation of training samples
Arinaldi et al. Cheating video description based on sequences of gestures
CN113111716A (en) Remote sensing image semi-automatic labeling method and device based on deep learning
Salem et al. Semantic image inpainting using self-learning encoder-decoder and adversarial loss
CN108416797A (en) A kind of method, equipment and the storage medium of detection Behavioral change
CN111091157B (en) Video self-supervision learning method based on shape-completion gap-filling task
Yin et al. Self-supervised patch localization for cross-domain facial action unit detection
CN109063732B (en) Image ranking method and system based on feature interaction and multi-task learning
CN115240141A (en) Method and system for identifying abnormal behavior of passenger in urban rail station
Pan et al. A deep learning based framework for UAV trajectory pattern recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant