CN112016682B - Video characterization learning and pre-training method and device, electronic equipment and storage medium - Google Patents

Video characterization learning and pre-training method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112016682B
CN112016682B CN202010772759.7A CN202010772759A CN112016682B CN 112016682 B CN112016682 B CN 112016682B CN 202010772759 A CN202010772759 A CN 202010772759A CN 112016682 B CN112016682 B CN 112016682B
Authority
CN
China
Prior art keywords
video
center
time sequence
feature
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010772759.7A
Other languages
Chinese (zh)
Other versions
CN112016682A (en
Inventor
王金鹏
王金桥
胡建国
赵朝阳
张海
朱贵波
林格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nexwise Intelligence China Ltd
Original Assignee
Nexwise Intelligence China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nexwise Intelligence China Ltd filed Critical Nexwise Intelligence China Ltd
Priority to CN202010772759.7A priority Critical patent/CN112016682B/en
Publication of CN112016682A publication Critical patent/CN112016682A/en
Application granted granted Critical
Publication of CN112016682B publication Critical patent/CN112016682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The embodiment of the invention provides a video representation learning and pre-training method and device, electronic equipment and storage medium, comprising the following steps: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example; respectively extracting a center video feature, a positive example video feature and a negative example video feature; calculating time sequence discrimination learning loss through the center video features, the positive example video features and the negative example video features, improving the feature similarity of time sequence similarity pairs, reducing the feature similarity of time sequence dissimilarity pairs, and optimizing the neural network, wherein the time sequence similarity pairs comprise centers and positive examples, and the time sequence dissimilarity pairs comprise centers and negative examples. According to the embodiment of the invention, by generating the time sequence triples from a single video, an efficient optimization method is provided to improve the representation capability of the model on the video.

Description

Video characterization learning and pre-training method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a method and apparatus for learning and pre-training video representation, an electronic device, and a storage medium.
Background
Video representation learning is the most basic link in video related tasks, and subsequent tasks such as video understanding, segmentation and the like are all based on the learned video representation. The current deep learning method adopts a large amount of annotation data to perform model pre-training learning in order to learn effective video characterization, and then the model is transferred to a downstream task for fine tuning. However, the data annotation, especially the video annotation, requires a lot of manpower and material resources, so that the video characterization learning by using a lot of unlabeled videos is a more practical and efficient technology.
The existing unsupervised video characterization method mainly guides a network to learn video characterization through spatial variation and variation of video frame time sequence. The spatial variation is based on the output of the video space before the spatial variation is predicted by the network, and the sequence of frames is based on the disturbance of the sequence of video frames to verify or predict the sequence of frames by the network.
The existing MoCo is an efficient unsupervised image characterization learning method, and the effectiveness is verified in the field of image characterization. For unlabeled samples in the data set, each sample point is regarded as a sample in the data set and is stored in a buffer queue, and a large amount of positive example data is generated through various sample enhancement modes. And then, comparing the positive example data after sample enhancement in the learning process, and searching an original sample point by querying a cache queue. Meanwhile, a momentum update strategy is introduced, and the parameter update is in the form of momentum when the encoder counter propagates, so that the encoder is prevented from being changed drastically, and the consistency before and after the update is obviously changed. However, moCo is only suitable for unsupervised image characterization learning, and cannot perform efficient learning on video characterization. Existing image data enhancement methods tend to be directed to two-dimensional data and visual features can be well preserved. However, for video such three-dimensional data structures, it is often difficult to define visual features similar to two-dimensional images. Therefore, moCo can only be applied to the representation learning in the image field, and is difficult to directly transfer to the video field, and the current contrast learning is mainly applied to the image representation learning. Meanwhile, the video representation learning and the video understanding model have a problem, the representation learned by the network is easy to depend on the spatial background information of the video instead of specific time sequence information, so that the video representation learned by the data collected in some specific scenes often lacks generalization and is difficult to be truly applied in other actual scenes.
Video representation learning has more time dimension information than image representation learning, so the data structure of the video is sparse. Therefore, in order to learn efficient video characterization, a lot of annotation data is often required which is far more expensive than the image. In addition, because of the directionality of the time dimension, discriminative properties in the time dimension are required for a good video representation, so current mainstream methods focus on the sequential prediction and restoration of video frame sequences by the network, but these methods ignore the dependence of the network on video spatial information and the importance of the timing information itself. Because of the above difficulties, existing unsupervised video characterization learning methods have difficulty in efficiently learning useful video characterizations.
Disclosure of Invention
The embodiment of the invention provides a video representation learning and pre-training method and device, electronic equipment and storage medium, which aim to utilize unlabeled video data to carry out video representation learning, and the effective video representation of neural network learning is guided by constructing a time sequence similarity pair and a time sequence dissimilarity pair through the way of generating a time sequence triplet from a single video; meanwhile, an efficient optimization method is provided to improve the representation capability of the model to the video.
The embodiment of the invention provides a video representation learning method, which comprises the following steps:
s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
s2: extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
s3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving feature similarity of time sequence similarity pairs, reducing feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example.
According to the video characterization learning method of one embodiment of the invention, the negative example is acquired from the sample video, and the sampling formula of the negative example is as follows
|t a -t n |>τ
Wherein t is a Representing a first video segment start time, t n Representing the third video segment start time, τ is the interval time constant.
According to the video characterization learning method of one embodiment of the invention, the calculation formula of the time sequence discrimination learning loss is as follows:
wherein,representing a time sequence discrimination learning loss function, d representing the feature similarity with the center, v representing the input feature, a, n and p representing the center, the negative example and the positive example respectively.
According to an embodiment of the present invention, the method for learning video characterization, before S2 includes: the spatial similarity of the positive example is reduced by spatially perturbing each frame of the second video segment; and/or, after S3, the method includes: and caching the calculated central video features.
The embodiment of the invention provides a video representation learning device, which comprises a time sequence triplet acquisition module, a feature extraction module and a network optimization module, wherein the feature extraction module is respectively connected with the time sequence triplet acquisition module and the network optimization module, and the video representation learning device comprises the following components:
the time sequence triplet collection module is used for collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
the feature extraction module is used for extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
the network optimization module is used for calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving the feature similarity of a time sequence similarity pair, reducing the feature similarity of a time sequence dissimilarity pair, and optimizing the neural network, wherein the time sequence similarity pair comprises the center and the positive example, and the time sequence dissimilarity pair comprises the center and the negative example.
The video characterization learning device according to one embodiment of the present invention further includes a data enhancement module connected to the feature extraction module, where the data enhancement module is configured to reduce spatial similarity of the positive example by spatially perturbing each frame of the second video segment; and/or, the system also comprises a memory caching module for caching the calculated central video characteristics.
The embodiment of the invention provides a video representation pre-training method, which comprises the following steps:
s4: taking a downstream behavior video as a training part of the video representation learning device, and identifying by the video representation learning method;
s5: and verifying the performance of the video representation learning device through the identification result of the S4.
The embodiment of the invention provides a video representation pre-training device, which comprises a video representation learning device, and further comprises an identification module and a verification module, wherein the identification module is used for taking a downstream behavior video as a training part of the video representation learning device to identify through the video representation learning device, and the verification module is used for verifying the performance of the video representation learning device through an identification result of the identification module.
The embodiment of the invention provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the video characterization learning method when executing the program.
Embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video characterization learning method.
The embodiment of the invention provides a novel video representation learning and pre-training method and device, electronic equipment and storage medium, wherein a time sequence triplet is constructed to carry out comparison learning to learn a time sequence discrimination video representation, and a comparison learning framework is introduced into unsupervised video representation learning; generating a sequential triplet structure from a single video according to video characteristics to perform efficient video characterization learning around a center structure positive example and a center structure negative example; the triple comprises a center, a positive example and a negative example, the generation mode is based on a sampling strategy, the purpose of the triple is to enable network learning to be independent of representation of background information and consistent in time sequence information, a time sequence similarity pair and a time sequence dissimilarity pair are constructed to guide the neural network to learn effective video representation by generating the time sequence triple from a single video, and meanwhile, an efficient optimization method is provided to improve the representation capability of a model on the video.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a video representation learning method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a video representation learning device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a video representation learning device according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a video characterization pre-training method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Reference numerals:
10: a time sequence triplet acquisition module; 20: a feature extraction module; 30: a network optimization module; 40: a data enhancement module; 50: a memory cache module; 810: a processor; 820: a communication interface; 830: a memory; 840: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A video representation learning method according to an embodiment of the present invention is described below with reference to fig. 1, including:
s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; a third video clip is acquired as a negative example of a clip that is not centered.
In S1, the positive examples are acquired from different spatial positions of the same time slice as the center. Although the spatial locations are not identical, the subjects of the video are identical, so when the spatial sample locations overlap sufficiently large, we can consider that the center and the positive example maintain identical timing information. Because the background and context information within a video segment remains similar with a high probability and the timing information is changing throughout, the selection of negative examples can be seen as video segments with similar spatial information but inconsistent timing information. If the network is able to learn inconsistent features from video with similar background, the network can be considered to learn timing related features rather than features that are too dependent on the background.
S2: extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
and S2, extracting the center video feature, the positive example video feature and the negative example video feature, wherein the extracted video features can respectively represent the first video segment, the second video segment and the third video segment.
S3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving feature similarity of time sequence similarity pairs, reducing feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example.
In S3, the neural network refers to a deep learning neural network, and the optimization of the neural network specifically refers to determining the backward propagation of learning loss according to time sequence, and updating the parameters of the neural network.
In the embodiment of the scheme, the time sequence identification video characterization is learned by constructing a time sequence triplet for comparison learning, and a comparison learning framework is introduced into unsupervised video characterization learning; a key point in video characterization learning is modeling timing information, so the present embodiment proposes a timing triplet construction method based on a single video. The triples comprise a center, a positive example and a negative example, the generation mode is based on a sampling strategy, and the purpose of the triples is to enable the network to learn the representation which is independent of background information and consistent in time sequence information. By generating timing triples from a single video, constructing timing similar pairs and timing dissimilar pairs to guide the neural network to learn an effective video representation; meanwhile, an efficient optimization method is provided to improve the representation capability of the model to the video.
The negative example is acquired from the sample video, and the sampling formula of the negative example is as follows
|t a -t n |>τ
Wherein t is a Representing a first video segment start time, t n Representing the third video segment start time, τ is the interval time constant.
To speed up sampling of the timing dissimilarity pair, sampling of the negative example in each computation timing discrimination learning and other sample centers can be regarded as the timing dissimilarity pair. The negative example can also be sampled in a different sample video from the center, but the negative example in the sample video, that is, in the same video as the center, has better effect.
The calculation formula of the time sequence discrimination learning loss is as follows:
wherein,representing a time sequence discrimination learning loss function, d representing the feature similarity with the center, v representing the input feature, a, n and p representing the center, the negative example and the positive example respectively. And according to the time sequence discrimination learning loss function calculation, the neural network parameters are updated through time sequence discrimination learning loss back propagation.
The step S2 includes: the spatial similarity of the positive example is reduced by spatially perturbing each frame of the second video segment; and/or, after S3, the method includes: and caching the calculated central video features. To learn video characterization more efficiently, embodiments may employ a data enhancement and memory caching module 50 to cache centers. The data enhancement is to make the spatial similarity of the positive example video as low as possible but keep the original time sequence information consistent by spatially disturbing each frame of the video, wherein the spatial similarity of the positive example video refers to the similarity of the positive example video, and the specific method is to randomly select one frame of background picture in the whole sample video to be interpolated and overlapped on each frame of the video. The center feature calculated each time is added to the memory buffer module 50, so that the efficiency of judging the learning loss in the next calculation time sequence is improved.
The specific working principle of data enhancement is as follows: since video is a 3D signal, it contains information at two levels, 1-dimensional time and 2-dimensional space. Furthermore, the temporal and spatial dimensions naturally have asymmetry. Time information is ambiguous and abstract and difficult to define and identify. In early approaches to classifying based on manual video features, inter-frame differences were used to provide useful indicative motion. Along these lines, the time derivative can be used to measure changes in time information. In particular, video can be considered a spatiotemporal function, while the time derivatives remain uniformly applied in any order to the function of spatiotemporal addition or multiplication with constants. Embodiments of the present invention devise a novel and effective data enhancement method Temporal Consistent Augmentation (TCA) for video by delving into the approach of video data enhancement. The TCA avoids the need for a real tag and can be expanded into self-supervision and semi-supervision learning.
Based on the TCA, the data reinforcement learning method according to the embodiment of the present invention includes:
mixing a static image into each frame of the sample video according to the scale factor;
the calculation formula of the scale factor is as follows:
wherein α represents a scale factor; i function represents original video; the delta function represents a frame of randomly selected images; t is a video frame at the moment t; x and y are pixel indexes of the visual frame at the moment t; k is the derivative of the order and k is the natural number.
The principle of the calculation formula of the scale factor is as follows: differentiation in the video with respect to the time dimension can be used to measure the extent and magnitude of the change in timing information. Therefore, consider introducing a time-sequential scaling effect into the video. In particular, while preserving the time derivative, additional spatial context (images) can be introduced into the spatio-temporal function (video) with a scaling factor α to maintain consistency of any order. That is, the time-series differential consistency can be regarded as a static image that is equally mixed into each frame of video. By selecting the appropriate scale factors, the similarity of the time cues in different spatial contexts can be preserved. The scale factor alpha of each frame in the sample video is uniform, namely, a frame of fixed image is taken to interpolate with each frame of the video.
By introducing video consistency regularization, and blending the images into each frame, the spatial distribution of pixels is changed while maintaining a temporally varying similarity. Taking into account the length of one video, a mask of 0-1 is usedAnd global noise->Specifically, the data reinforcement learning method further includes:
calculating video frames of each moment of the sample video through the scale factors, wherein a calculation formula is as follows:
wherein,a video frame representing video i at time j; l represents the video length; />A mask representing 0-1; />Representing global noise; alpha represents a scale factor, consistent with alpha described above.
The alpha is from [0.5,1 ]]In a uniformly distributed random sampling result, maskAnd global noise->Is the same as the size of the first frame image in the sample video.
The data reinforcement learning method specifically further comprises the following steps:
randomly selecting a preset area with fixed size, and selecting the preset areaSetting the preset area to be 0, wherein the preset area is within 0.1 of the whole static image area;
the random selection is selected randomly according to an algorithm that uniformly distributes the samples. A preset area is selected by setting a mask,setting to 0, the pixel is erased.
Is provided withAll elements of (1) and randomly selecting an image frame from said sample video as +.>
The mask 1 is not operated.
To all framesSet to 1 and randomly select a frame in a video other than the sample video as +.>Specifically, frames of other videos except the sample video can be selected from a small batch of videos during trainingIt is also possible to randomly select a frame in an arbitrary video as +.>
The choice of global noise from sample to sample and within samples can greatly enrich the diversity of spatial contexts. In the present invention, temporal consistency enhancement (TCA) is a cascade of these three data enhancements, which are the three steps performed in linear order of values.
Further, the learning of the whole model can be guided by training the original sample and generating the consistency of the sample, so that the embodiment of the invention provides a data enhancement training method, the generated sample is obtained by adopting the data enhancement training method, and the method further comprises the following steps: training the consistency of the generated sample and the sample video through deep learning.
Training the consistency of the generated sample and the sample video through deep learning specifically comprises the following steps:
randomly disturbing all sample videos in a training set, and taking batch processed data from the random disturbed sample videos; the training set contains a plurality of sample videos, and random scrambling is realized by uniformly distributing sampling.
Randomly disturbing the batched data, and performing data reinforcement learning on each sample video to obtain a generated sample; random scrambling is achieved by uniformly distributing the samples.
And respectively inputting the sample video and the generated sample into a training model to obtain two output values, measuring the difference between the two output values through a square loss function, and carrying out gradient descent on the training model based on the difference. The training model refers to a neural network for deep learning, and the training model obtained through final learning is more sensitive to time sequence information.
The novel video data enhancement method TCA is utilized to guide the learning target of the whole neural network, and the TCA can be simply integrated in any neural network. In addition, TCA can be realized through simple matrix operation, the calculation cost is very small, the method of the embodiment of the invention obtains the optimal effect on three data sets, and the effectiveness of the data enhancement method is verified.
As shown in fig. 2, an embodiment of the present invention provides a video representation learning device, which includes a time sequence triplet collection module 10, a feature extraction module 20, and a network optimization module 30, wherein the feature extraction module 20 is respectively connected with the time sequence triplet collection module 10 and the network optimization module 30, and the video representation learning device includes:
the time sequence triplet collection module 10 is used for collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
the feature extraction module 20 is configured to extract a center video feature, a positive example video feature, and a negative example video feature according to the first video segment, the second video segment, and the third video segment, respectively;
the network optimization module 30 is configured to calculate a time sequence discrimination learning loss for the center video feature, the positive example video feature, and the negative example video feature, to improve feature similarity of a time sequence similarity pair, to reduce feature similarity of a time sequence dissimilarity pair, and to optimize a neural network, where the time sequence similarity pair includes the center and the positive example, and the time sequence dissimilarity pair includes the center and the negative example.
As shown in fig. 3, the video characterization learning device further includes a data enhancement module 40 connected to the feature extraction module 20, where the data enhancement module 40 is configured to reduce the spatial similarity of the positive examples by spatially perturbing each frame of the second video segment; and/or further comprises a memory caching module 50 for caching the computed central video features.
The working principle of the video characterization learning device in this embodiment is corresponding to the video characterization learning method in the foregoing embodiment, and will not be described in detail here.
As shown in fig. 4, an embodiment of the present invention provides a video characterization pre-training method, including:
s4: taking a downstream behavior video as a training part of the video representation learning device, and identifying by the video representation learning method;
s5: and verifying the performance of the video representation learning device through the identification result of the S4.
According to the method, a new unsupervised video representation learning method is provided to guide a neural network to learn efficient video representation from a large amount of unsupervised video data for a downstream video related task, namely, the deep learning neural network obtained by the video representation learning method is used as a pre-training model, the method achieves an optimal effect on three behavior recognition task data sets, and meanwhile, the supervised performance can be achieved on thousands of non-labeling samples, so that the effectiveness of the unsupervised video representation learning method is verified.
The embodiment of the invention provides a video representation pre-training device, which comprises a video representation learning device, and further comprises an identification module and a verification module, wherein the identification module is used for taking a downstream behavior video as a training part of the video representation learning device to identify through the video representation learning device, and the verification module is used for verifying the performance of the video representation learning device through an identification result of the identification module.
The working principle of the video characterization pre-training device in this embodiment is corresponding to that of the video characterization pre-training method in the foregoing embodiment, and will not be described in detail here.
Fig. 5 illustrates a physical schematic diagram of an electronic device, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a video characterization learning method comprising:
s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
s2: extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
s3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving feature similarity of time sequence similarity pairs, reducing feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a video characterization learning method provided by the above method embodiments, the method comprising:
s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
s2: extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
s3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving feature similarity of time sequence similarity pairs, reducing feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example.
In yet another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform a video characterization learning method provided by the above embodiments, the method comprising:
s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
s2: extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
s3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving feature similarity of time sequence similarity pairs, reducing feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method of video characterization learning, comprising:
s1: collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
s2: extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
s3: calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving the feature similarity of time sequence similarity pairs, reducing the feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example;
the calculation formula of the time sequence discrimination learning loss is as follows:
wherein,representing a time sequence discrimination learning loss function, wherein d function represents feature similarity with a center, v represents input features, and a, n and p represent a center, a negative example and a positive example respectively;
the step S2 includes: the spatial similarity of the positive examples is reduced by spatially perturbing each frame of the second video segment.
2. The method according to claim 1, wherein the negative example is obtained from the sample video by a sampling formula of the negative example
Wherein,representing the start time of the first video clip,/for>Representing the start time of the third video clip,/for>Is an interval time constant.
3. The video characterization learning method according to claim 1, wherein the S3 thereafter comprises: and caching the calculated central video features.
4. The video representation learning device is characterized by comprising a time sequence triplet acquisition module, a feature extraction module and a network optimization module, wherein the feature extraction module is respectively connected with the time sequence triplet acquisition module and the network optimization module, and the video representation learning device comprises the following components:
the time sequence triplet collection module is used for collecting a first video segment preset in a sample video as a center, and collecting a second video segment which is in the same time segment and different spatial positions with the center in the sample video as a positive example; collecting a third video segment which is different from the center as a negative example;
the feature extraction module is used for extracting center video features, positive example video features and negative example video features according to the first video segment, the second video segment and the third video segment respectively;
the network optimization module is used for calculating time sequence discrimination learning loss for the center video feature, the positive example video feature and the negative example video feature, improving the feature similarity of time sequence similarity pairs, reducing the feature similarity of time sequence dissimilarity pairs, and optimizing a neural network, wherein the time sequence similarity pairs comprise the center and the positive example, and the time sequence dissimilarity pairs comprise the center and the negative example;
the calculation formula of the time sequence discrimination learning loss is as follows:
wherein,representing a time sequence discrimination learning loss function, wherein d function represents feature similarity with a center, v represents input features, and a, n and p represent a center, a negative example and a positive example respectively;
the system further comprises a data enhancement module connected with the feature extraction module, wherein the data enhancement module is used for reducing the spatial similarity of the positive examples by performing spatial disturbance on each frame of the second video segment.
5. The video characterization learning device of claim 4, further comprising a memory caching module for caching the computed center video features.
6. A method of video characterization pre-training, comprising:
s4: taking a downstream behavior video as a training part of the video representation learning device, and identifying the video representation learning device by the video representation learning method according to any one of claims 1-3;
s5: and verifying the performance of the video representation learning device through the identification result of the S4.
7. A video characterization pre-training device, comprising the video characterization learning device according to claim 4 or 5, and further comprising an identification module and a verification module, wherein the identification module is used for taking a downstream behavior video as a training part of the video characterization learning device to be identified by the video characterization learning device, and the verification module is used for verifying the performance of the video characterization learning device through the identification result of the identification module.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the video characterization learning method of any of claims 1-3 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the video representation learning method of any of claims 1-3.
CN202010772759.7A 2020-08-04 2020-08-04 Video characterization learning and pre-training method and device, electronic equipment and storage medium Active CN112016682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010772759.7A CN112016682B (en) 2020-08-04 2020-08-04 Video characterization learning and pre-training method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010772759.7A CN112016682B (en) 2020-08-04 2020-08-04 Video characterization learning and pre-training method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112016682A CN112016682A (en) 2020-12-01
CN112016682B true CN112016682B (en) 2024-01-26

Family

ID=73500163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010772759.7A Active CN112016682B (en) 2020-08-04 2020-08-04 Video characterization learning and pre-training method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112016682B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561060B (en) * 2020-12-15 2022-03-22 北京百度网讯科技有限公司 Neural network training method and device, image recognition method and device and equipment
CN113723607A (en) * 2021-06-02 2021-11-30 京东城市(北京)数字科技有限公司 Training method, device and equipment of space-time data processing model and storage medium
CN113837260A (en) * 2021-09-17 2021-12-24 北京百度网讯科技有限公司 Model training method, object matching method, device and electronic equipment
CN113569824B (en) * 2021-09-26 2021-12-17 腾讯科技(深圳)有限公司 Model processing method, related device, storage medium and computer program product
CN114596312B (en) * 2022-05-07 2022-08-02 中国科学院深圳先进技术研究院 Video processing method and device
CN115115972A (en) * 2022-05-25 2022-09-27 腾讯科技(深圳)有限公司 Video processing method, video processing apparatus, computer device, medium, and program product

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
CN108122234A (en) * 2016-11-29 2018-06-05 北京市商汤科技开发有限公司 Convolutional neural networks training and method for processing video frequency, device and electronic equipment
CN109948561A (en) * 2019-03-25 2019-06-28 广东石油化工学院 The method and system that unsupervised image/video pedestrian based on migration network identifies again
CN110263697A (en) * 2019-06-17 2019-09-20 哈尔滨工业大学(深圳) Pedestrian based on unsupervised learning recognition methods, device and medium again
KR20190125029A (en) * 2018-04-27 2019-11-06 성균관대학교산학협력단 Methods and apparatuses for generating text to video based on time series adversarial neural network
CN110689483A (en) * 2019-09-24 2020-01-14 重庆邮电大学 Image super-resolution reconstruction method based on depth residual error network and storage medium
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
CN111160380A (en) * 2018-11-07 2020-05-15 华为技术有限公司 Method for generating video analysis model and video analysis system
CN111274445A (en) * 2020-01-20 2020-06-12 山东建筑大学 Similar video content retrieval method and system based on triple deep learning
CN111444826A (en) * 2020-03-25 2020-07-24 腾讯科技(深圳)有限公司 Video detection method and device, storage medium and computer equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108122234A (en) * 2016-11-29 2018-06-05 北京市商汤科技开发有限公司 Convolutional neural networks training and method for processing video frequency, device and electronic equipment
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
KR20190125029A (en) * 2018-04-27 2019-11-06 성균관대학교산학협력단 Methods and apparatuses for generating text to video based on time series adversarial neural network
CN111160380A (en) * 2018-11-07 2020-05-15 华为技术有限公司 Method for generating video analysis model and video analysis system
CN109948561A (en) * 2019-03-25 2019-06-28 广东石油化工学院 The method and system that unsupervised image/video pedestrian based on migration network identifies again
CN110263697A (en) * 2019-06-17 2019-09-20 哈尔滨工业大学(深圳) Pedestrian based on unsupervised learning recognition methods, device and medium again
CN110689483A (en) * 2019-09-24 2020-01-14 重庆邮电大学 Image super-resolution reconstruction method based on depth residual error network and storage medium
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
CN111274445A (en) * 2020-01-20 2020-06-12 山东建筑大学 Similar video content retrieval method and system based on triple deep learning
CN111444826A (en) * 2020-03-25 2020-07-24 腾讯科技(深圳)有限公司 Video detection method and device, storage medium and computer equipment

Also Published As

Publication number Publication date
CN112016682A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
CN112016682B (en) Video characterization learning and pre-training method and device, electronic equipment and storage medium
US11200424B2 (en) Space-time memory network for locating target object in video content
CN111444878B (en) Video classification method, device and computer readable storage medium
CN111192292B (en) Target tracking method and related equipment based on attention mechanism and twin network
Gao et al. Domain-adaptive crowd counting via high-quality image translation and density reconstruction
CN108241854B (en) Depth video saliency detection method based on motion and memory information
CN111968123B (en) Semi-supervised video target segmentation method
Chen et al. Learning linear regression via single-convolutional layer for visual object tracking
CN113239869B (en) Two-stage behavior recognition method and system based on key frame sequence and behavior information
CN116686017A (en) Time bottleneck attention architecture for video action recognition
CN112560827B (en) Model training method, model training device, model prediction method, electronic device, and medium
CN110399826B (en) End-to-end face detection and identification method
CN110827312A (en) Learning method based on cooperative visual attention neural network
CN111742345A (en) Visual tracking by coloring
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
GB2579262A (en) Space-time memory network for locating target object in video content
CN112818904A (en) Crowd density estimation method and device based on attention mechanism
CN113988147A (en) Multi-label classification method and device for remote sensing image scene based on graph network, and multi-label retrieval method and device
Zhang et al. Unsupervised depth estimation from monocular videos with hybrid geometric-refined loss and contextual attention
CN114996495A (en) Single-sample image segmentation method and device based on multiple prototypes and iterative enhancement
CN113222903A (en) Full-section histopathology image analysis method and system
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN116740362A (en) Attention-based lightweight asymmetric scene semantic segmentation method and system
Wang et al. SCNet: Scale-aware coupling-structure network for efficient video object detection
CN116758449A (en) Video salient target detection method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant