CN112016682B

CN112016682B - Video characterization learning and pre-training method and device, electronic equipment and storage medium

Info

Publication number: CN112016682B
Application number: CN202010772759.7A
Authority: CN
Inventors: 王金鹏; 王金桥; 胡建国; 赵朝阳; 张海; 朱贵波; 林格
Original assignee: Nexwise Intelligence China Ltd
Current assignee: Nexwise Intelligence China Ltd
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2024-01-26
Anticipated expiration: 2040-08-04
Also published as: CN112016682A

Abstract

The embodiment of the invention provides a video representation learning and pre-training method and device, electronic equipment and storage medium, comprising the following steps: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example; respectively extracting a center video feature, a positive example video feature and a negative example video feature; calculating time sequence discrimination learning loss through the center video features, the positive example video features and the negative example video features, improving the feature similarity of time sequence similarity pairs, reducing the feature similarity of time sequence dissimilarity pairs, and optimizing the neural network, wherein the time sequence similarity pairs comprise centers and positive examples, and the time sequence dissimilarity pairs comprise centers and negative examples. According to the embodiment of the invention, by generating the time sequence triples from a single video, an efficient optimization method is provided to improve the representation capability of the model on the video.

Description

Video characterization learning and pre-training method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method and apparatus for learning and pre-training video representation, an electronic device, and a storage medium.

Background

Video representation learning is the most basic link in video related tasks, and subsequent tasks such as video understanding, segmentation and the like are all based on the learned video representation. The current deep learning method adopts a large amount of annotation data to perform model pre-training learning in order to learn effective video characterization, and then the model is transferred to a downstream task for fine tuning. However, the data annotation, especially the video annotation, requires a lot of manpower and material resources, so that the video characterization learning by using a lot of unlabeled videos is a more practical and efficient technology.

The existing unsupervised video characterization method mainly guides a network to learn video characterization through spatial variation and variation of video frame time sequence. The spatial variation is based on the output of the video space before the spatial variation is predicted by the network, and the sequence of frames is based on the disturbance of the sequence of video frames to verify or predict the sequence of frames by the network.

The existing MoCo is an efficient unsupervised image characterization learning method, and the effectiveness is verified in the field of image characterization. For unlabeled samples in the data set, each sample point is regarded as a sample in the data set and is stored in a buffer queue, and a large amount of positive example data is generated through various sample enhancement modes. And then, comparing the positive example data after sample enhancement in the learning process, and searching an original sample point by querying a cache queue. Meanwhile, a momentum update strategy is introduced, and the parameter update is in the form of momentum when the encoder counter propagates, so that the encoder is prevented from being changed drastically, and the consistency before and after the update is obviously changed. However, moCo is only suitable for unsupervised image characterization learning, and cannot perform efficient learning on video characterization. Existing image data enhancement methods tend to be directed to two-dimensional data and visual features can be well preserved. However, for video such three-dimensional data structures, it is often difficult to define visual features similar to two-dimensional images. Therefore, moCo can only be applied to the representation learning in the image field, and is difficult to directly transfer to the video field, and the current contrast learning is mainly applied to the image representation learning. Meanwhile, the video representation learning and the video understanding model have a problem, the representation learned by the network is easy to depend on the spatial background information of the video instead of specific time sequence information, so that the video representation learned by the data collected in some specific scenes often lacks generalization and is difficult to be truly applied in other actual scenes.

Video representation learning has more time dimension information than image representation learning, so the data structure of the video is sparse. Therefore, in order to learn efficient video characterization, a lot of annotation data is often required which is far more expensive than the image. In addition, because of the directionality of the time dimension, discriminative properties in the time dimension are required for a good video representation, so current mainstream methods focus on the sequential prediction and restoration of video frame sequences by the network, but these methods ignore the dependence of the network on video spatial information and the importance of the timing information itself. Because of the above difficulties, existing unsupervised video characterization learning methods have difficulty in efficiently learning useful video characterizations.

Disclosure of Invention

The embodiment of the invention provides a video representation learning and pre-training method and device, electronic equipment and storage medium, which aim to utilize unlabeled video data to carry out video representation learning, and the effective video representation of neural network learning is guided by constructing a time sequence similarity pair and a time sequence dissimilarity pair through the way of generating a time sequence triplet from a single video; meanwhile, an efficient optimization method is provided to improve the representation capability of the model to the video.

The embodiment of the invention provides a video representation learning method, which comprises the following steps:

s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;

s2: extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;

s3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving feature similarity of time sequence similarity pairs, reducing feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example.

According to the video characterization learning method of one embodiment of the invention, the negative example is acquired from the sample video, and the sampling formula of the negative example is as follows

|t _a -t _n |>τ

Wherein t is _a Representing a first video segment start time, t _n Representing the third video segment start time, τ is the interval time constant.

According to the video characterization learning method of one embodiment of the invention, the calculation formula of the time sequence discrimination learning loss is as follows:

wherein,representing a time sequence discrimination learning loss function, d representing the feature similarity with the center, v representing the input feature, a, n and p representing the center, the negative example and the positive example respectively.

According to an embodiment of the present invention, the method for learning video characterization, before S2 includes: the spatial similarity of the positive example is reduced by spatially perturbing each frame of the second video segment; and/or, after S3, the method includes: and caching the calculated central video features.

The embodiment of the invention provides a video representation learning device, which comprises a time sequence triplet acquisition module, a feature extraction module and a network optimization module, wherein the feature extraction module is respectively connected with the time sequence triplet acquisition module and the network optimization module, and the video representation learning device comprises the following components:

the time sequence triplet collection module is used for collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;

the feature extraction module is used for extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;

the network optimization module is used for calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving the feature similarity of a time sequence similarity pair, reducing the feature similarity of a time sequence dissimilarity pair, and optimizing the neural network, wherein the time sequence similarity pair comprises the center and the positive example, and the time sequence dissimilarity pair comprises the center and the negative example.

The video characterization learning device according to one embodiment of the present invention further includes a data enhancement module connected to the feature extraction module, where the data enhancement module is configured to reduce spatial similarity of the positive example by spatially perturbing each frame of the second video segment; and/or, the system also comprises a memory caching module for caching the calculated central video characteristics.

The embodiment of the invention provides a video representation pre-training method, which comprises the following steps:

s4: taking a downstream behavior video as a training part of the video representation learning device, and identifying by the video representation learning method;

s5: and verifying the performance of the video representation learning device through the identification result of the S4.

The embodiment of the invention provides a video representation pre-training device, which comprises a video representation learning device, and further comprises an identification module and a verification module, wherein the identification module is used for taking a downstream behavior video as a training part of the video representation learning device to identify through the video representation learning device, and the verification module is used for verifying the performance of the video representation learning device through an identification result of the identification module.

The embodiment of the invention provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the video characterization learning method when executing the program.

Embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video characterization learning method.

The embodiment of the invention provides a novel video representation learning and pre-training method and device, electronic equipment and storage medium, wherein a time sequence triplet is constructed to carry out comparison learning to learn a time sequence discrimination video representation, and a comparison learning framework is introduced into unsupervised video representation learning; generating a sequential triplet structure from a single video according to video characteristics to perform efficient video characterization learning around a center structure positive example and a center structure negative example; the triple comprises a center, a positive example and a negative example, the generation mode is based on a sampling strategy, the purpose of the triple is to enable network learning to be independent of representation of background information and consistent in time sequence information, a time sequence similarity pair and a time sequence dissimilarity pair are constructed to guide the neural network to learn effective video representation by generating the time sequence triple from a single video, and meanwhile, an efficient optimization method is provided to improve the representation capability of a model on the video.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a video representation learning method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a video representation learning device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video representation learning device according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a video characterization pre-training method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Reference numerals:

10: a time sequence triplet acquisition module; 20: a feature extraction module; 30: a network optimization module; 40: a data enhancement module; 50: a memory cache module; 810: a processor; 820: a communication interface; 830: a memory; 840: a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A video representation learning method according to an embodiment of the present invention is described below with reference to fig. 1, including:

s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; a third video clip is acquired as a negative example of a clip that is not centered.

In S1, the positive examples are acquired from different spatial positions of the same time slice as the center. Although the spatial locations are not identical, the subjects of the video are identical, so when the spatial sample locations overlap sufficiently large, we can consider that the center and the positive example maintain identical timing information. Because the background and context information within a video segment remains similar with a high probability and the timing information is changing throughout, the selection of negative examples can be seen as video segments with similar spatial information but inconsistent timing information. If the network is able to learn inconsistent features from video with similar background, the network can be considered to learn timing related features rather than features that are too dependent on the background.

and S2, extracting the center video feature, the positive example video feature and the negative example video feature, wherein the extracted video features can respectively represent the first video segment, the second video segment and the third video segment.

In S3, the neural network refers to a deep learning neural network, and the optimization of the neural network specifically refers to determining the backward propagation of learning loss according to time sequence, and updating the parameters of the neural network.

In the embodiment of the scheme, the time sequence identification video characterization is learned by constructing a time sequence triplet for comparison learning, and a comparison learning framework is introduced into unsupervised video characterization learning; a key point in video characterization learning is modeling timing information, so the present embodiment proposes a timing triplet construction method based on a single video. The triples comprise a center, a positive example and a negative example, the generation mode is based on a sampling strategy, and the purpose of the triples is to enable the network to learn the representation which is independent of background information and consistent in time sequence information. By generating timing triples from a single video, constructing timing similar pairs and timing dissimilar pairs to guide the neural network to learn an effective video representation; meanwhile, an efficient optimization method is provided to improve the representation capability of the model to the video.

The negative example is acquired from the sample video, and the sampling formula of the negative example is as follows

|t _a -t _n |>τ

To speed up sampling of the timing dissimilarity pair, sampling of the negative example in each computation timing discrimination learning and other sample centers can be regarded as the timing dissimilarity pair. The negative example can also be sampled in a different sample video from the center, but the negative example in the sample video, that is, in the same video as the center, has better effect.

The calculation formula of the time sequence discrimination learning loss is as follows:

wherein,representing a time sequence discrimination learning loss function, d representing the feature similarity with the center, v representing the input feature, a, n and p representing the center, the negative example and the positive example respectively. And according to the time sequence discrimination learning loss function calculation, the neural network parameters are updated through time sequence discrimination learning loss back propagation.

The step S2 includes: the spatial similarity of the positive example is reduced by spatially perturbing each frame of the second video segment; and/or, after S3, the method includes: and caching the calculated central video features. To learn video characterization more efficiently, embodiments may employ a data enhancement and memory caching module 50 to cache centers. The data enhancement is to make the spatial similarity of the positive example video as low as possible but keep the original time sequence information consistent by spatially disturbing each frame of the video, wherein the spatial similarity of the positive example video refers to the similarity of the positive example video, and the specific method is to randomly select one frame of background picture in the whole sample video to be interpolated and overlapped on each frame of the video. The center feature calculated each time is added to the memory buffer module 50, so that the efficiency of judging the learning loss in the next calculation time sequence is improved.

The specific working principle of data enhancement is as follows: since video is a 3D signal, it contains information at two levels, 1-dimensional time and 2-dimensional space. Furthermore, the temporal and spatial dimensions naturally have asymmetry. Time information is ambiguous and abstract and difficult to define and identify. In early approaches to classifying based on manual video features, inter-frame differences were used to provide useful indicative motion. Along these lines, the time derivative can be used to measure changes in time information. In particular, video can be considered a spatiotemporal function, while the time derivatives remain uniformly applied in any order to the function of spatiotemporal addition or multiplication with constants. Embodiments of the present invention devise a novel and effective data enhancement method Temporal Consistent Augmentation (TCA) for video by delving into the approach of video data enhancement. The TCA avoids the need for a real tag and can be expanded into self-supervision and semi-supervision learning.

Based on the TCA, the data reinforcement learning method according to the embodiment of the present invention includes:

mixing a static image into each frame of the sample video according to the scale factor;

the calculation formula of the scale factor is as follows:

wherein α represents a scale factor; i function represents original video; the delta function represents a frame of randomly selected images; t is a video frame at the moment t; x and y are pixel indexes of the visual frame at the moment t; k is the derivative of the order and k is the natural number.

The principle of the calculation formula of the scale factor is as follows: differentiation in the video with respect to the time dimension can be used to measure the extent and magnitude of the change in timing information. Therefore, consider introducing a time-sequential scaling effect into the video. In particular, while preserving the time derivative, additional spatial context (images) can be introduced into the spatio-temporal function (video) with a scaling factor α to maintain consistency of any order. That is, the time-series differential consistency can be regarded as a static image that is equally mixed into each frame of video. By selecting the appropriate scale factors, the similarity of the time cues in different spatial contexts can be preserved. The scale factor alpha of each frame in the sample video is uniform, namely, a frame of fixed image is taken to interpolate with each frame of the video.

By introducing video consistency regularization, and blending the images into each frame, the spatial distribution of pixels is changed while maintaining a temporally varying similarity. Taking into account the length of one video, a mask of 0-1 is usedAnd global noise->Specifically, the data reinforcement learning method further includes:

calculating video frames of each moment of the sample video through the scale factors, wherein a calculation formula is as follows:

wherein,a video frame representing video i at time j; l represents the video length; />A mask representing 0-1; />Representing global noise; alpha represents a scale factor, consistent with alpha described above.

The alpha is from [0.5,1 ]]In a uniformly distributed random sampling result, maskAnd global noise->Is the same as the size of the first frame image in the sample video.

The data reinforcement learning method specifically further comprises the following steps:

randomly selecting a preset area with fixed size, and selecting the preset areaSetting the preset area to be 0, wherein the preset area is within 0.1 of the whole static image area;

the random selection is selected randomly according to an algorithm that uniformly distributes the samples. A preset area is selected by setting a mask,setting to 0, the pixel is erased.

Is provided withAll elements of (1) and randomly selecting an image frame from said sample video as +.>

The mask 1 is not operated.

To all framesSet to 1 and randomly select a frame in a video other than the sample video as +.>Specifically, frames of other videos except the sample video can be selected from a small batch of videos during trainingIt is also possible to randomly select a frame in an arbitrary video as +.>

The choice of global noise from sample to sample and within samples can greatly enrich the diversity of spatial contexts. In the present invention, temporal consistency enhancement (TCA) is a cascade of these three data enhancements, which are the three steps performed in linear order of values.

Further, the learning of the whole model can be guided by training the original sample and generating the consistency of the sample, so that the embodiment of the invention provides a data enhancement training method, the generated sample is obtained by adopting the data enhancement training method, and the method further comprises the following steps: training the consistency of the generated sample and the sample video through deep learning.

Training the consistency of the generated sample and the sample video through deep learning specifically comprises the following steps:

randomly disturbing all sample videos in a training set, and taking batch processed data from the random disturbed sample videos; the training set contains a plurality of sample videos, and random scrambling is realized by uniformly distributing sampling.

Randomly disturbing the batched data, and performing data reinforcement learning on each sample video to obtain a generated sample; random scrambling is achieved by uniformly distributing the samples.

And respectively inputting the sample video and the generated sample into a training model to obtain two output values, measuring the difference between the two output values through a square loss function, and carrying out gradient descent on the training model based on the difference. The training model refers to a neural network for deep learning, and the training model obtained through final learning is more sensitive to time sequence information.

The novel video data enhancement method TCA is utilized to guide the learning target of the whole neural network, and the TCA can be simply integrated in any neural network. In addition, TCA can be realized through simple matrix operation, the calculation cost is very small, the method of the embodiment of the invention obtains the optimal effect on three data sets, and the effectiveness of the data enhancement method is verified.

As shown in fig. 2, an embodiment of the present invention provides a video representation learning device, which includes a time sequence triplet collection module 10, a feature extraction module 20, and a network optimization module 30, wherein the feature extraction module 20 is respectively connected with the time sequence triplet collection module 10 and the network optimization module 30, and the video representation learning device includes:

the time sequence triplet collection module 10 is used for collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;

the feature extraction module 20 is configured to extract a center video feature, a positive example video feature, and a negative example video feature according to the first video segment, the second video segment, and the third video segment, respectively;

the network optimization module 30 is configured to calculate a time sequence discrimination learning loss for the center video feature, the positive example video feature, and the negative example video feature, to improve feature similarity of a time sequence similarity pair, to reduce feature similarity of a time sequence dissimilarity pair, and to optimize a neural network, where the time sequence similarity pair includes the center and the positive example, and the time sequence dissimilarity pair includes the center and the negative example.

As shown in fig. 3, the video characterization learning device further includes a data enhancement module 40 connected to the feature extraction module 20, where the data enhancement module 40 is configured to reduce the spatial similarity of the positive examples by spatially perturbing each frame of the second video segment; and/or further comprises a memory caching module 50 for caching the computed central video features.

The working principle of the video characterization learning device in this embodiment is corresponding to the video characterization learning method in the foregoing embodiment, and will not be described in detail here.

As shown in fig. 4, an embodiment of the present invention provides a video characterization pre-training method, including:

According to the method, a new unsupervised video representation learning method is provided to guide a neural network to learn efficient video representation from a large amount of unsupervised video data for a downstream video related task, namely, the deep learning neural network obtained by the video representation learning method is used as a pre-training model, the method achieves an optimal effect on three behavior recognition task data sets, and meanwhile, the supervised performance can be achieved on thousands of non-labeling samples, so that the effectiveness of the unsupervised video representation learning method is verified.

The working principle of the video characterization pre-training device in this embodiment is corresponding to that of the video characterization pre-training method in the foregoing embodiment, and will not be described in detail here.

Fig. 5 illustrates a physical schematic diagram of an electronic device, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a video characterization learning method comprising:

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a video characterization learning method provided by the above method embodiments, the method comprising:

In yet another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform a video characterization learning method provided by the above embodiments, the method comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of video characterization learning, comprising:

s3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving the feature similarity of time sequence similarity pairs, reducing the feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example;

；

wherein,representing a time sequence discrimination learning loss function, wherein d function represents feature similarity with a center, v represents input features, and a, n and p represent a center, a negative example and a positive example respectively;

the step S2 includes: the spatial similarity of the positive examples is reduced by spatially perturbing each frame of the second video segment.

2. The method according to claim 1, wherein the negative example is obtained from the sample video by a sampling formula of the negative example

；

Wherein,representing the start time of the first video clip,/for>Representing the start time of the third video clip,/for>Is an interval time constant.

3. The video characterization learning method according to claim 1, wherein the S3 thereafter comprises: and caching the calculated central video features.

4. The video representation learning device is characterized by comprising a time sequence triplet acquisition module, a feature extraction module and a network optimization module, wherein the feature extraction module is respectively connected with the time sequence triplet acquisition module and the network optimization module, and the video representation learning device comprises the following components:

the network optimization module is used for calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving the feature similarity of time sequence similarity pairs, reducing the feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example;

；

the system further comprises a data enhancement module connected with the feature extraction module, wherein the data enhancement module is used for reducing the spatial similarity of the positive examples by performing spatial disturbance on each frame of the second video segment.

5. The video characterization learning device of claim 4, further comprising a memory caching module for caching the computed center video features.

6. A method of video characterization pre-training, comprising:

s4: taking a downstream behavior video as a training part of the video representation learning device, and identifying the video representation learning device by the video representation learning method according to any one of claims 1-3;

7. A video characterization pre-training device, comprising the video characterization learning device according to claim 4 or 5, and further comprising an identification module and a verification module, wherein the identification module is used for taking a downstream behavior video as a training part of the video characterization learning device to be identified by the video characterization learning device, and the verification module is used for verifying the performance of the video characterization learning device through the identification result of the identification module.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the video characterization learning method of any of claims 1-3 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the video representation learning method of any of claims 1-3.