CN111950332A

CN111950332A - Video time sequence positioning method and device, computing equipment and storage medium

Info

Publication number: CN111950332A
Application number: CN201910412596.9A
Authority: CN
Inventors: 许昀璐; 程战战; 钮毅
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2020-11-17
Anticipated expiration: 2039-05-17
Also published as: CN111950332B

Abstract

The application discloses a video time sequence positioning method and device, computing equipment and a storage medium, and belongs to the field of video monitoring. The embodiment of the application provides a video time sequence positioning method, which comprises the steps of inputting a plurality of first video segments cut from a monitoring video into an event positioning model, and determining the time range of each first event in the monitoring video according to the model. The event localization model is trained based on a plurality of second video segments included in the first sample video and at least one second event of the first sample video annotation. The model used in the method is an event positioning model, and because each frame of image does not need to be marked during training, when the model is used for time sequence positioning, only the event in each first video segment in the monitoring video needs to be identified and positioned, and each frame of image in the monitoring video does not need to be identified and positioned, so that the time length of video time sequence positioning is shortened, and the video time sequence positioning efficiency is improved.

Description

Video time sequence positioning method and device, computing equipment and storage medium

Technical Field

The present application relates to the field of video surveillance. In particular, to a method, an apparatus, a computing device, and a storage medium for video timing positioning.

Background

With the popularity of video monitoring systems, the data volume of monitoring videos is increasingly huge. Video data of a plurality of events may be included in one surveillance video. For example, video data of events such as A, B, C may be included in a surveillance video, and when a relevant person wants to view a surveillance record of each event in the surveillance video, it is necessary to determine the time range in which each event in the surveillance video occurs. For example, when a relevant person wants to view a monitoring record of an event a, it is necessary to determine the time range of the event a in the monitoring video; when the related personnel want to view the monitoring record of the B event, the time range of the B event in the monitoring video needs to be determined.

In the related technology, when the time sequence positioning is carried out on the monitoring video, a time sequence positioning model is obtained firstly, then the monitoring video to be positioned is input into the time sequence positioning model, and the time range of each event in the monitoring video is output. The process of training the time sequence positioning model mainly comprises the following steps: the method comprises the steps of obtaining a monitoring video, manually marking the category of each frame of image in the monitoring video to obtain a sample video, carrying out model training according to each frame of image manually marked in the sample video, and finally obtaining a time sequence positioning model.

However, in the related art, each frame of image needs to be labeled when the model is trained, and each frame of image in the monitoring video needs to be identified and positioned one by one when the time sequence positioning model is used for carrying out time sequence positioning on the monitoring video, so that the time sequence positioning of the monitoring video is long and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a video time sequence positioning method, a video time sequence positioning device, computing equipment and a storage medium, and can solve the problems of long time consumption and low efficiency of time sequence positioning of a monitoring video. The technical scheme is as follows:

in one aspect, a video timing positioning method is provided, and the method includes:

when the time sequence positioning is carried out on the monitoring video to be positioned, the monitoring video is cut into a plurality of first video segments;

inputting each first video segment into an event positioning model to obtain a first event corresponding to each first video segment in the monitoring video, wherein the event positioning model is obtained by training based on a plurality of second video segments included in a first sample video and at least one second event labeled in the first sample video;

and determining the time range of each first event in the monitoring video according to the first event corresponding to each first video segment and the time range of each first video segment.

In one possible implementation, the method further includes:

cutting the first sample video into a plurality of second video segments, wherein at least one second event is marked in the first sample video;

identifying at least one third event from the first sample video according to first feature vectors corresponding to a plurality of second video segments included in the first sample video;

and performing model training according to at least one third event identified from the first sample video, at least one labeled second event and the plurality of second video segments to obtain the event positioning model.

In another possible implementation manner, the identifying, from the first sample video, at least one third event according to the first feature vector corresponding to a plurality of second video segments included in the first sample video includes:

acquiring a first feature vector corresponding to each second video segment in the first sample video;

for the first feature vector of each second video segment, determining a plurality of second feature vectors corresponding to the first sample video;

and identifying at least one third event from the first sample video according to a plurality of second feature vectors corresponding to the first sample video.

In another possible implementation manner, the determining, for the first feature vector of each second video segment, a plurality of second feature vectors corresponding to the first sample video includes:

determining the weight of each second video segment according to the first feature vector of each second video segment;

determining a first confidence coefficient between the first feature vectors of any two second video segments according to the first feature vector of each second video segment;

and weighting the first characteristic vectors of which the first confidence degrees exceed a preset threshold in the first characteristic vectors of each second video segment according to the weight of each second video segment to obtain a plurality of second characteristic vectors.

In another possible implementation manner, the identifying at least one third event from the first sample video according to a plurality of second feature vectors corresponding to the first sample video includes:

for each second feature vector, determining a second confidence between the second feature vector and each specified event according to the second feature vector;

according to a second confidence degree between the second feature vector and each specified event, selecting the specified event with the highest confidence degree with the second feature vector from each specified event;

and taking the selected specified event as a second event corresponding to the second feature vector.

In another possible implementation manner, before inputting each first video segment into an event positioning model and obtaining a first event corresponding to each first video segment in the surveillance video, the method further includes:

acquiring a second sample video, wherein at least one fourth event and the time range of each fourth event are marked in the second sample video;

inputting the second sample video into the event localization model, and outputting at least one fifth event of the second sample video and a time range of occurrence of each fifth event;

testing the event localization model according to the at least one fourth event and the time range of occurrence of each fourth event, and the at least one fifth event and the time range of occurrence of each fifth event;

and when the event positioning model is tested successfully, executing the step of inputting each first video clip into the event positioning model to obtain a first event corresponding to each first video clip in the monitoring video.

In another possible implementation manner, the testing the event localization model according to the time ranges of the at least one fourth event and the occurrence of each fourth event, and the time ranges of the at least one fifth event and the occurrence of each fifth event includes:

determining that the event localization model test is successful when the at least one fourth event matches the at least one fifth event and the time range of each fourth event occurrence matches the time range of each fifth event occurrence.

In another possible implementation manner, the determining, according to the first event corresponding to each first video segment and the time range of each first video segment, the time range in which each first event in the surveillance video occurs includes:

and for each first event in the monitoring video, according to at least one first video segment corresponding to the first event, taking the time range of the at least one first video segment corresponding to the first event as the time range of the occurrence of the first event.

In another aspect, a video timing positioning apparatus is provided, the apparatus comprising:

the cutting module is used for cutting the monitoring video to be positioned into a plurality of first video segments when the monitoring video to be positioned is subjected to time sequence positioning;

the event positioning model is obtained by training based on a plurality of second video segments included in a first sample video and at least one second event labeled in the first sample video;

and the determining module is used for determining the time range of each first event in the monitoring video according to the first event corresponding to each first video segment and the time range of each first video segment.

In one possible implementation, the apparatus further includes:

the cutting module is further configured to cut the first sample video into a plurality of second video segments, where at least one second event is marked in the first sample video;

the identification module is used for identifying at least one third event from the first sample video according to the first feature vectors corresponding to a plurality of second video segments included in the first sample video;

and the training module is used for carrying out model training according to at least one third event identified from the first sample video, at least one labeled second event and the plurality of second video segments to obtain the event positioning model.

In another possible implementation manner, the identification module is further configured to obtain a first feature vector corresponding to each second video segment in the first sample video; for the first feature vector of each second video segment, determining a plurality of second feature vectors corresponding to the first sample video; and identifying at least one third event from the first sample video according to a plurality of second feature vectors corresponding to the first sample video.

In another possible implementation manner, the identification module is further configured to determine a weight of each second video segment according to the first feature vector of each second video segment; determining a first confidence coefficient between the first feature vectors of any two second video segments according to the first feature vector of each second video segment; and weighting the first characteristic vectors of which the first confidence degrees exceed a preset threshold in the first characteristic vectors of each second video segment according to the weight of each second video segment to obtain a plurality of second characteristic vectors.

In another possible implementation manner, the identification module is further configured to determine, for each second feature vector, a second confidence between the second feature vector and each specified event according to the second feature vector; according to a second confidence degree between the second feature vector and each specified event, selecting the specified event with the highest confidence degree with the second feature vector from each specified event; and taking the selected specified event as a second event corresponding to the second feature vector.

In another possible implementation manner, the apparatus further includes:

the acquisition module is used for acquiring a second sample video, wherein at least one fourth event and the time range of each fourth event are marked in the second sample video;

the input module is further used for inputting the second sample video into the event positioning model, and outputting at least one fifth event of the second sample video and a time range of occurrence of each fifth event;

the testing module is further used for testing the event positioning model according to the at least one fourth event and the time range of each fourth event, and the at least one fifth event and the time range of each fifth event;

the input module is further configured to input each first video segment into the event positioning model when the event positioning model is successfully tested, so as to obtain a first event corresponding to each first video segment in the surveillance video.

In another possible implementation manner, the testing module is further configured to determine that the event location model test is successful when the at least one fourth event matches with the at least one fifth event, and the time range of occurrence of each fourth event matches with the time range of occurrence of each fifth event.

In another possible implementation manner, the determining module is further configured to, for each first event in the surveillance video, according to at least one first video segment corresponding to the first event, use a time range of the at least one first video segment corresponding to the first event as a time range in which the first event occurs.

In another aspect, a computing device is provided, the computing device comprising:

a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the instruction, the program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the operations performed by any of the above-described video timing positioning methods.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which is loaded and executed by a processor to implement the operations performed by any of the above video timing positioning methods.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

according to the video time sequence positioning method provided by the embodiment of the application, when time sequence positioning is carried out on a surveillance video to be positioned, the surveillance video is cut into a plurality of first video segments, each first video segment is input into an event positioning model, a first event corresponding to each first video segment in the surveillance video is obtained, and the time range of each first event in the surveillance video is determined according to the first event corresponding to each first video segment and the time range of each first video segment. The event positioning model is trained based on a plurality of second video segments included in the first sample video and at least one second event marked in the first sample video. The model used in the method is an event positioning model, and the model only needs to acquire a plurality of second video segments and at least one second event included in the first sample video during training without labeling each frame of image, so that when the model is used for time sequence positioning, only the event in each first video segment in the monitoring video needs to be identified and positioned, and each frame of image in the monitoring video does not need to be identified and positioned, thereby shortening the time length of video time sequence positioning and improving the efficiency of video time sequence positioning.

Drawings

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of a video timing positioning method according to an embodiment of the present application;

FIG. 3 is a flowchart of an event location model training method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a computing device training model provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a computing device determining weights of a first video segment through an attention mechanism network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a computing device determining a second feature vector through an attention mechanism network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a computing device determining a plurality of second events in a first sample video according to an embodiment of the present application;

fig. 8 is a flowchart of a video timing positioning method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a computing device determining a time range of each third event in a surveillance video according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a video timing positioning apparatus according to an embodiment of the present application;

fig. 11 is a block diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions and advantages of the present application more clear, the following describes the embodiments of the present application in further detail.

The embodiment of the application provides an implementation environment for video timing positioning, which includes: a computing device 101, the computing device 101 may be a terminal or a server. In the embodiment of the present application, the computing apparatus 101 is not particularly limited. For ease of distinction, when computing device 101 is a server, it is referred to as a first server; when the computing device 101 is a terminal, the server that trains the event location model is referred to as the second server 102.

When the computing device 101 is a first server, the first server may train an initial model to obtain an event positioning model, and when the time sequence positioning needs to be performed on the surveillance video, the first server cuts the surveillance video into a plurality of first video segments, inputs each first video segment into the event positioning model, and determines a time range in which each first event occurs in the surveillance video to be positioned through the event positioning model. When the computing device 101 is a terminal, in a possible implementation manner, the terminal trains an initial model to obtain an event positioning model, when the time sequence positioning of the surveillance video is required, the terminal cuts the surveillance video into a plurality of first video segments, inputs each first video segment into the event positioning model, and determines the time range of each first event in the surveillance video to be positioned through the event positioning model. In another possible implementation manner, the terminal obtains an event positioning model obtained by training the second server 102, the second server 102 and the terminal may be wirelessly connected, when the time sequence positioning of the surveillance video is required, the terminal cuts the surveillance video into a plurality of first video segments, inputs each first video segment into the event positioning model, and determines a time range of occurrence of each first event in the surveillance video to be positioned. Correspondingly, when the terminal acquires the event location model trained by the second server 102, the implementation environment also includes the second server 102, see fig. 1. In the embodiments of the present application, this is not particularly limited.

In the related technology, the training process of the event positioning model mainly includes manually marking the category of each frame of image in a monitoring video to obtain a sample video, performing model training according to the sample video, and finally obtaining the event positioning model. However, in the related technology, each frame of image in the sample video needs to be manually labeled, so that not only a large amount of manpower and material resources are consumed, but also the time consumption is long, and the efficiency of model training is low.

In the embodiment of the present application, the initial model is trained to obtain the event location model based on a plurality of second video segments included in the first sample video and at least one second event labeled in the first sample video. The model used in the method is an event positioning model, and the model only needs to acquire a plurality of second video segments and at least one second event included in the first sample video during training without labeling each frame of image, so that when the model is used for time sequence positioning, only the event in each first video segment in the monitoring video needs to be identified and positioned, and each frame of image in the monitoring video does not need to be identified and positioned, thereby shortening the time length of video time sequence positioning and improving the efficiency of video time sequence positioning.

An embodiment of the present application provides a video timing positioning method, referring to fig. 2, the method includes:

step 201: when the time sequence positioning is carried out on the surveillance video to be positioned, the surveillance video is cut into a plurality of first video segments.

Step 202: and inputting each first video segment into an event positioning model to obtain a first event corresponding to each first video segment in the monitoring video, wherein the event positioning model is obtained by training based on a plurality of second video segments included in the first sample video and at least one second event labeled in the first sample video.

Step 203: and determining the time range of each first event in the monitoring video according to the first event corresponding to each first video segment and the time range of each first video segment.

In one possible implementation, the method further includes:

identifying at least one third event from the first sample video according to the first feature vectors corresponding to a plurality of second video segments included in the first sample video;

and carrying out model training according to at least one third event identified from the first sample video, at least one labeled second event and a plurality of second video segments to obtain an event positioning model.

In another possible implementation manner, identifying at least one third event from the first sample video according to the first feature vectors corresponding to the plurality of second video segments included in the first sample video includes:

at least one third event is identified from the first sample video according to a plurality of second feature vectors corresponding to the first sample video.

In another possible implementation manner, for the first feature vector of each second video segment, determining a plurality of second feature vectors corresponding to the first sample video includes:

In another possible implementation manner, identifying at least one third event from the first sample video according to a plurality of second feature vectors corresponding to the first sample video includes:

In another possible implementation manner, before each first video segment is input into the event positioning model and the first event corresponding to each first video segment in the surveillance video is obtained, the method further includes:

inputting the second sample video into an event positioning model, and outputting at least one fifth event of the second sample video and the time range of each fifth event;

testing the event positioning model according to the at least one fourth event and the time range of each fourth event and the at least one fifth event and the time range of each fifth event;

and when the event positioning model is tested successfully, inputting each first video clip into the event positioning model to obtain a first event corresponding to each first video clip in the monitoring video.

In another possible implementation manner, the testing the event localization model according to the at least one fourth event and the time range of occurrence of each fourth event, and the at least one fifth event and the time range of occurrence of each fifth event includes:

and when the at least one fourth event is matched with the at least one fifth event, and the time range of the occurrence of each fourth event is matched with the time range of the occurrence of each fifth event, determining that the event positioning model test is successful.

In another possible implementation manner, determining a time range of occurrence of each first event in the surveillance video according to the first event corresponding to each first video segment and the time range of each first video segment includes:

and for each first event in the monitoring video, according to the at least one first video segment corresponding to the first event, taking the time range of the at least one first video segment corresponding to the first event as the time range of the occurrence of the first event.

An embodiment of the present application provides a training method for an event localization model, and referring to fig. 3, the method includes:

step 301: the computing device crops the first sample video into a plurality of second video segments, each first sample video having at least one first event annotated therein.

In the embodiment of the application, the computing device trains the initial model through the first sample video to finally obtain the event positioning model. The number of the first sample videos may be one or more, and in the embodiment of the present application, this is not particularly limited.

Prior to this step, the computing device first obtains a first sample video. After obtaining the first sample video, the computing device cuts the first sample video into a plurality of second video segments through the initial model, wherein at least one second event is marked in the first sample video, and each second video segment comprises at least one frame of image.

The computing device may crop in any manner when cropping the first sample video. For example, the computing device may clip by sample rate or number of sample frames. When the computing device crops the first sample video by sampling rate or number of sampling frames, the first sample video may be cropped at equal intervals or may be cropped at unequal intervals. For example, when the computing device crops by a sample rate and crops at unequal intervals, the computing device may crop at 2 seconds intervals when currently cropping; at the next cut, the cuts may be made at 5 second intervals. In the embodiments of the present application, this is not particularly limited.

And for a plurality of second video clips, the plurality of second video clips comprise an event video clip and a background video clip, wherein the event video clip is the second video clip with the event, and the background video clip is the second video clip without the event. Referring to fig. 4, the uppermost rectangles in fig. 4 form a first sample video, each rectangle represents a second video segment, the 5 labeled second video segments in the figure are event video segments, which are A, B and C event types respectively, and the unlabeled rectangle is a background video segment.

Step 302: the computing device obtains a first feature vector corresponding to each second video segment in the first sample video.

In this step, for the first sample video, the computing device performs feature extraction on each second video segment in the first sample video to obtain a first feature vector of each second video segment.

For each second video segment, the computing device may directly extract the features of the second video segment through the first feature extractor to obtain a first feature vector of the second video segment; or, the computing device extracts the features of each frame of image in the second video segment through the second feature extractor to obtain a first feature vector of each frame of image, determines an average vector of the first feature vector of each frame of image, and takes the average vector as the first feature vector of the second video segment. In the embodiment of the present application, the way in which the computing device determines the first feature vector corresponding to each second video segment is not specifically limited, and the first feature extractor and the second feature extractor are not specifically limited.

Step 303: for the first feature vector of each second video segment, the computing device determines a plurality of second feature vectors corresponding to the first sample video.

The second eigenvector is the corresponding eigenvector after the first eigenvector is weighted. This step can be realized by the following steps (1) to (3), including:

(1) the computing device determines a weight for each second video segment based on the first feature vector for each second video segment.

In this step, for the first feature vector of each second video segment, the computing device may determine the weight of the second video segment through the attention mechanism network. The weight is a value between 0 and 1, the magnitude of the weight represents the probability of the occurrence of the third event in the second video segment, and the larger the weight is, the larger the probability of the occurrence of the third event in the second video segment is, and the larger the probability that the time range of the second video segment is the time range of the occurrence of the third event is. Wherein, the attention mechanism network can comprise a convolution neural network and at least one fully-connected layer, and the output end of the convolution neural network is connected with the input end of the at least one fully-connected layer. Accordingly, the step of the computing device determining the weight of each second video segment by the attention mechanism network may be: for each second video segment, the computing equipment inputs the first feature vector into the convolutional neural network to obtain an intermediate feature vector, inputs the intermediate feature vector into at least one layer of full-connected layer, and outputs the intermediate feature vector to obtain a numerical value which is a 1-dimensional scalar; and the calculation equipment calculates the numerical value through the objective function to finally obtain the weight of the second video segment. The objective function may be any function, for example, the objective function may be a sigmoid function. In the embodiment of the present application, the objective function is not particularly limited.

The convolutional neural network can be selected and modified as required, for example, the convolutional neural network can be a one-dimensional convolutional neural network, a two-dimensional convolutional neural network, or a three-dimensional convolutional neural network. In the embodiments of the present application, this is not particularly limited. The number of convolutional layers and pooling layers in the convolutional neural network can also be set and changed as required. Wherein the attention mechanism network performs attention learning on the first feature vector of each second video segment, and the process of determining the weight of the second video segment can be seen in fig. 5.

It should be noted that, before the computing device determines the weight of each second video segment, the first feature vector of each second video segment may be compared with the reference feature vector, and the first feature vector with a high matching degree with the reference feature vector may be removed. The matching degree of the reference characteristic vector and the first characteristic vector of the background video segment is high, so that the computing equipment can remove the first characteristic vector of the background video segment through the method, the influence of the first characteristic vector of the background video segment on other second video segments with events is avoided, the task amount of the computing equipment for training the initial model is reduced, and the model training efficiency is improved.

It should be noted that each first sample video includes at least one second event, and there may be an influence between each second event, so that the computing device may further instruct the attention mechanism network to determine weights of other second video segments, except for the second video segment where the background is located, in the plurality of second video segments according to a third feature vector corresponding to the second feature vector. For a first one of the other second video segments, the computing device determines a weight for the first one of the second video segments through the attention mechanism network; for each second video segment except for the first second video segment in the other second video segments, when the computing device learns the first feature vector of each second video segment in the other second video segments through the attention mechanism network, according to the third feature vector corresponding to the second feature vector of the previous second video segment, the attention learning of the attention mechanism network on the first feature vector corresponding to the third feature vector is increased, that is, the weight of the first feature vector corresponding to the third feature vector is increased, and finally the weight of each second video segment in the other second video segments is obtained. The third eigenvector corresponding to the second eigenvector is introduced in detail in step 304, and is not described herein again.

The process that the computing device learns the first feature vectors through the attention mechanism network to obtain the weight of each second video segment is the process of classifying the second video segments, the computing device gathers the feature vectors of the events of the same category together through the attention mechanism to obtain the second video segments where the events are located, and finally the time range of the events is determined according to the second video segments.

(2) The computing device determines a first confidence between the first feature vectors of any two second video segments according to the first feature vector of each second video segment.

For each second video segment, the computing device may determine a distance between the first feature vector of the second video segment and the first feature vector of any other second video segment according to the first feature vector of the second video segment and the first feature vector of any other second video segment. The computing device determines a first confidence degree between the first feature vector of the second video segment and the first feature vector of any other video segment according to the distance between the first feature vector of the second video segment and the first feature vector of any other video segment. The distance between the two first feature vectors and the first confidence coefficient are in negative correlation, and the closer the distance, the higher the first confidence coefficient, and the higher the similarity of the first feature vectors of the two second video segments. In one possible implementation, the computing device may take an inverse of a distance between the two first feature vectors as a first confidence between the two first feature vectors.

Wherein the computing device may determine the distance between the first feature vectors of any two second video segments by any method. For example, the computing device may determine the distance between the first feature vectors of any two second video segments by euclidean distance or mahalanobis distance. In the embodiments of the present application, this is not particularly limited.

(3) And the computing equipment weights the first characteristic vectors of which the first confidence degrees exceed the preset threshold in the first characteristic vectors of each second video segment according to the weight of each second video segment to obtain a plurality of second characteristic vectors.

The computing device may select a first feature vector of which the first confidence coefficient exceeds a preset threshold from the first feature vectors of the plurality of second video segments, and weight the selected first feature vector according to the weight of the second video segment corresponding to the selected first feature vector to obtain the plurality of second feature vectors. For example, with continued reference to fig. 4, for the first second video segment in fig. 4, the computing device determines first confidence degrees between the first feature vectors of the first second video segment and the first feature vectors of other second video segments, selects a first feature vector from the plurality of first confidence degrees whose first confidence degree exceeds a preset threshold, for example, the first confidence degrees of the first feature vector of the third second video segment and the first feature vector of the first second video segment exceed a preset threshold, and then the computing device weights the first feature vector of the third second video segment according to the weight of the third second video segment; and weighting the first feature vector of the first second video segment according to the weight of the first second video segment to obtain two second feature vectors. The computing device may use an average vector of the two second feature vectors as the second feature vector for the event.

It should be noted that, after the computing device obtains the plurality of second feature vectors, the plurality of second feature vectors may be sorted according to the sequential positions of the second video segments corresponding to the second feature vectors in the first sample video, so as to obtain a second feature vector sequence.

The preset threshold may be set and changed as needed, and in the embodiment of the present application, the preset threshold is not specifically limited. For example, the preset threshold may be 0.8, 0.85, or 0.9. Referring to fig. 6, in fig. 6, the computing device learns the weight of each second video segment through the attention mechanism network according to the first feature vector of each second video segment, and finally obtains a plurality of second feature vectors corresponding to the first sample video.

Step 304: the computing device identifies at least one third event from the first sample video according to a plurality of second feature vectors corresponding to the first sample video.

This step can be realized by the following steps (1) to (3), including:

(1) for each second feature vector, the computing device determines a second confidence between the second feature vector and each specified event based on the second feature vector.

A plurality of specified events are configured in the computing device in advance, the specified events are events in different categories respectively, and the specified events comprise at least one second event marked in the first sample video. The computing device identifies, from the first sample video, a second event from the initial model corresponding to each second feature vector by determining a second confidence between the second feature vector and each event.

In one possible implementation, for each second feature vector, the computing device may determine, from the second feature vector and the feature vector of each specified event, a distance between the second feature vector and the feature vector of each specified event; and determining the confidence between the second feature vector and the feature vector of each specified event according to the distance between the second feature vector and the feature vector of each specified event, and taking the confidence between the second feature vector and the feature vector of each specified event as the second confidence between the second feature vector and each specified event. And the distance between the second feature vector and the feature vector of each specified event is in negative correlation change with the confidence coefficient, and the smaller the distance is, the greater the confidence coefficient is.

In another possible implementation, for each second feature vector, the computing device may determine a second confidence between a third feature vector corresponding to the second feature vector and the feature vector of each specified event. Accordingly, the steps may be: when the current second feature vector is the first feature vector in the first sample video, the computing device may input the second feature vector into the recurrent neural network to obtain a third feature vector. The computing device may connect a fully-connected layer behind the recurrent neural network, input the third feature vector into the fully-connected layer, and compare the third feature vector with the feature vector of each specified event to obtain a plurality of second confidence levels, where an output of the recurrent neural network is connected to an input of the fully-connected layer.

In another possible implementation manner, when the current second feature vector is a feature vector of the first sample video except the first feature vector, the computing device inputs the current second feature vector, a third feature vector corresponding to the previous second feature vector, and a second confidence degree of an event corresponding to the previous second feature vector into the recurrent neural network, so as to obtain a third feature vector corresponding to the current second feature vector. And the last second feature vector is a previous second feature vector of the current second feature vector. And the computing equipment inputs the third feature vector into the full-connection layer, and compares the third feature vector with the feature vector of each specified event to obtain a plurality of second confidence degrees.

It should be noted that, when the current second feature vector is the first feature vector in the first sample video, the computing device determines a plurality of second confidence degrees directly according to the second feature vector; when the current second feature vector is not the first feature vector in the first sample video, the computing device, in determining the second confidence between the current second feature vector and the feature vector of each specified event, determines a second confidence between the current second feature vector and the feature vector of each specified event by combining the second confidence of the event corresponding to the last second feature vector of the current second feature vector and the third feature vector corresponding to the last second feature vector. For example, with continued reference to fig. 4, when determining a second confidence degree between the second feature vector of the second video segment in which the event B is located and the feature vector of each specified event, the computing device may determine a second confidence degree between the second feature vector of the second video segment in which the event B is located and the feature vector of each specified event, in combination with the second confidence degree of the event corresponding to the second feature vector of the second video segment in which the event a is located and the third feature vector of the second video segment in which the event a is located.

The recurrent neural network may be set and changed as needed, and in the embodiment of the present application, the recurrent neural network is not particularly limited. For example, the recurrent neural network may be an LSTM (Long Short-Term Memory) network.

(2) The computing device selects, from each of the designated events, a designated event with a highest confidence level with the second feature vector based on a second confidence level between the second feature vector and each of the designated events.

And the computing equipment directly selects the corresponding specified event with the highest confidence degree from the plurality of second confidence degrees according to the plurality of second confidence degrees. For example, with continued reference to fig. 4, the designated events are A, B, C, D, E, etc., the computing device determines, according to the above steps, that a third event occurring in the first second video segment and the third second video segment is the same event, the computing device determines a second confidence between a second feature vector of the second video segment where the third event is located and the feature vector of each designated event, and obtains second confidences between the second feature vector and the feature vector of each designated event, which are 0.9, 0.6, 0.7, 0.2, and 0.5, respectively, and then the computing device selects the designated event a corresponding to the feature vector with the second confidence of 0.9 from the multiple second confidences.

(3) The computing device takes the selected specified event as a third event corresponding to the second feature vector.

The computing device directly takes the selected specified event as a third event corresponding to the second feature vector. For example, the computing device will specify event a as occurring in the first second video segment and the third second video segment.

Step 305: and the computing equipment performs model training according to at least one third event identified from the first sample video, at least one labeled second event and a plurality of second video segments to obtain an event positioning model.

The computing equipment trains an initial model through the first sample video, identifies at least one third event from the first sample video, and summarizes each third event to obtain at least one third event in the first sample. And comparing the identified at least one third event with the labeled at least one second event by the computing equipment, and determining the time range of each third event according to the starting time and the ending time of the second video segment where each third event is located when the event types of the at least one third event and the at least one second event are the same by the computing equipment so as to obtain an event positioning model. And when the event type of the third event is different from the event type of the at least one second event in the at least one third event, the computing equipment reversely propagates and adjusts the initial model parameters, and then trains again until the event types of the at least one third event and the at least one second event are the same, so that the event positioning model is obtained.

For example, the second events marked in the first sample video are A, B and C, respectively, and when the third event obtained by the computing device training the initial model through the above steps 301-305 is also A, B and C, the event localization model is obtained; when the third events identified by the computing device from the first sample video are A, C and D, the computing device adjusts the initial model parameters in reverse, and re-trains until the third events are A, B and C, as shown in fig. 7.

The training method of the event positioning model focuses on step-by-step attention and solution of multiple events, and obtains the prediction results of the multiple events in a summarizing mode according to the prediction result of each event. In addition, the training process of the event positioning model is an end-to-end network, and the step of fusing the independent processing networks is not required to be executed.

It should be noted that after the computing device executes step 305, the event positioning model may be directly obtained, and when the timing positioning instruction is obtained, the monitoring video to be positioned may be directly positioned. Or, after the computing device executes step 305, the event positioning model may be directly obtained, when the time sequence positioning instruction is obtained, the event positioning model is tested before the surveillance video to be positioned is positioned, and when the test is successful, the surveillance video to be positioned is positioned. Or after the event positioning model is obtained, the event positioning model is tested, and when the test is successful, the monitoring video to be positioned is positioned when the time sequence positioning instruction is obtained, so that the accuracy of time sequence positioning is improved.

In one possible implementation, the step of testing the event localization model by the computing device may be:

the computing equipment acquires a second sample video, wherein at least one fourth event and the time range of each fourth event are marked in the second sample video; inputting the second sample video into an event positioning model, and outputting at least one fifth event of the second sample video and the time range of each fifth event; and testing the event positioning model according to the at least one fourth event and the time range of each fourth event and the at least one fifth event and the time range of each fifth event.

The step of the computing device testing the event location model according to the at least one fourth event and the time range of each occurrence of the fourth event, and the at least one fifth event and the time range of each occurrence of the fifth event may be: when the at least one fourth event matches the at least one fifth event and the time range in which each fourth event occurs matches the time range in which each fifth event occurs, the computing device determines that the event localization model test was successful. Wherein the computing device may determine at least one fourth event in the second sample video and a time range in which each fourth event occurs by marking each frame image in the second sample video frame by frame.

When the at least one fourth event does not match the at least one fifth event, or the time range of occurrence of the fourth event does not match the time range of occurrence of the fifth event, the computing device determines that the event localization model test failed. When the event positioning model fails to be tested, the computing device may continue to train the event positioning model through the first sample video or a third sample video, where the third sample video is different from the first sample video, and at least one sixth event is marked in the third sample video. The process of training the event positioning model by the computing device through the third sample video is similar to the process of training the event positioning model by the computing device through the first sample video, and details are not repeated here. For example, the computing device trains the event positioning model through the third sample video to obtain a trained event positioning model, and the computing device tests the second sample video again. And when the test fails, the event positioning model is continuously trained through the first sample video or the third sample video until the test is successful.

The method for training the event positioning model in the embodiment of the application is different from the original positioning method for labeling each frame of image, the first sample video is cut into a plurality of second video segments, at least one second event labeled in the first sample video is identified according to each second video segment, the pressure for calibrating each frame of image is greatly reduced, the process is visual and simple, and the method is an end-to-end training process.

An embodiment of the present application provides a video timing sequence positioning method, and referring to fig. 8, the method includes:

step 801: when the monitoring video to be positioned is subjected to time sequence positioning, the computing equipment cuts the monitoring video into a plurality of first video segments.

In this step, the step of the computing device cutting the surveillance video into the plurality of first video segments is similar to the step of the computing device cutting the first sample video into the plurality of second video segments in step 301, and details are not repeated here.

Step 802: the computing device inputs each first video clip into the event positioning model to obtain a first event corresponding to each first video clip in the surveillance video.

In this step, the computing device directly inputs the plurality of first video segments into the trained event positioning model to obtain a first event corresponding to each first video segment in the surveillance video.

Step 803: for each first event in the surveillance video, the computing device takes the time range of at least one first video segment corresponding to the first event as the time range of the first event according to the at least one first video segment corresponding to the first event.

According to the first event obtained in step 802, the computing device may use the time range of at least one first video segment corresponding to the first event as the time range of the first event.

For example, referring to fig. 9, a surveillance video to be located in the figure is cut into a plurality of first video segments, and after the computing device inputs each first video segment into an event location model, two events included in the surveillance video are obtained, namely event D and event E. According to a first video segment where an event D is located, the computing device takes the starting time and the ending time of the first video segment as the time range of the event D; according to the first video clip of the event E, the starting time and the ending time of the first video clip are used as the time range of the event E, and the time range of the event D is finally obtained as (T)₁，T₂) And (T)₅，T₆) The time range of occurrence of event E is (T)₃，T₄)。

An embodiment of the present application provides a video timing positioning apparatus, referring to fig. 10, the apparatus includes:

the cutting module 1001 is configured to cut the surveillance video into a plurality of first video segments when the surveillance video to be positioned is subjected to time sequence positioning.

The input module 1002 is configured to input each first video segment into an event positioning model, so as to obtain a first event corresponding to each first video segment in the monitoring video, where the event positioning model is obtained by training based on a plurality of second video segments included in the first sample video and at least one second event labeled in the first sample video.

The determining module 1003 is configured to determine a time range in which each first event in the surveillance video occurs according to the first event corresponding to each first video segment and the time range of each first video segment.

In one possible implementation, the apparatus further includes:

the cropping module 1001 is further configured to crop the first sample video into a plurality of second video segments, where at least one second event is marked in the first sample video;

and the training module is used for carrying out model training according to at least one third event identified from the first sample video, at least one labeled second event and a plurality of second video segments to obtain an event positioning model.

In another possible implementation manner, the identification module is further configured to obtain a first feature vector corresponding to each second video segment in the first sample video; for the first feature vector of each second video segment, determining a plurality of second feature vectors corresponding to the first sample video; at least one third event is identified from the first sample video according to a plurality of second feature vectors corresponding to the first sample video.

In another possible implementation manner, the apparatus further includes:

the acquisition module is used for acquiring a second sample video, and at least one fourth event and the time range of each fourth event are marked in the second sample video;

the input module 1002 is further configured to input the second sample video into the event positioning model, and output at least one fifth event of the second sample video and a time range of occurrence of each fifth event;

the test module is further used for testing the event positioning model according to the at least one fourth event and the time range of each fourth event, and the at least one fifth event and the time range of each fifth event;

the input module 1002 is further configured to, when the event positioning model is successfully tested, input each first video segment into the event positioning model, so as to obtain a first event corresponding to each first video segment in the monitoring video.

In another possible implementation manner, the test module is further configured to determine that the event localization model test is successful when at least one fourth event matches with at least one fifth event, and a time range of occurrence of each fourth event matches with a time range of occurrence of each fifth event.

In another possible implementation manner, the determining module 1003 is further configured to, for each first event in the monitoring video, use a time range of at least one first video segment corresponding to the first event as a time range in which the first event occurs according to the at least one first video segment corresponding to the first event.

The video time sequence positioning device provided by the embodiment of the application cuts a surveillance video into a plurality of first video segments when the surveillance video to be positioned is subjected to time sequence positioning, inputs each first video segment into an event positioning model to obtain a first event corresponding to each first video segment in the surveillance video, and determines the time range of each first event in the surveillance video according to the first event corresponding to each first video segment and the time range of each first video segment. The event positioning model is trained based on a plurality of second video segments included in the first sample video and at least one second event marked in the first sample video. The model that the device used is the event location model, because this model only need obtain a plurality of second video clips and at least one second event that include in the first sample video when training, do not need to label every frame image, consequently, when using this model to carry out the chronogenesis location, only need to discern the location to the event in every first video clip in the surveillance video, and do not need to discern the location to every frame image in the surveillance video, thereby shortened video chronogenesis location and lasted for a long time, improved video chronogenesis location efficiency.

It should be noted that: in the video timing positioning apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration when the video timing is positioned, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the computing device is divided into different functional modules to complete all or part of the above described functions. In addition, the video timing positioning apparatus and the video timing positioning method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 11 is a block diagram of a computing device 1100 according to an embodiment of the present invention, where the computing device 1100 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1101 and one or more memories 1102, where the memory 1102 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1101 to implement the video timing positioning method provided by the above-mentioned method embodiments. Of course, the computing device may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the computing device may also include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer-readable storage medium is also provided, which when executed by the processor 1101 of the computing device 1100, enables the computing device 1100 to perform the operations performed in the video timing positioning method in the above embodiments. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The above description is only for facilitating the understanding of the technical solutions of the present application by those skilled in the art, and is not intended to limit the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for video timing alignment, the method comprising:

2. The method of claim 1, further comprising:

3. The method according to claim 2, wherein the identifying at least one third event from the first sample video according to the first feature vector corresponding to a plurality of second video segments included in the first sample video comprises:

4. The method of claim 3, wherein the determining, for the first feature vector of each second video segment, a plurality of second feature vectors corresponding to the first sample video comprises:

5. The method according to claim 3, wherein the identifying at least one third event from the first sample video according to a plurality of second feature vectors corresponding to the first sample video comprises:

6. The method of claim 1, wherein before entering each first video segment into an event localization model and obtaining the first event corresponding to each first video segment in the surveillance video, the method further comprises:

7. The method of claim 6, wherein the testing the event localization model according to the time ranges of the at least one fourth event and the each fourth event, and the time ranges of the at least one fifth event and the each fifth event comprises:

8. The method according to any one of claims 1-7, wherein determining the time range of each first event in the surveillance video according to the first event corresponding to each first video segment and the time range of each first video segment comprises:

9. A video timing alignment apparatus, comprising:

10. The apparatus of claim 9, further comprising:

11. The apparatus of claim 9, further comprising:

12. A computing device, wherein the computing device comprises:

a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the instruction, the program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the operations performed in the video timing positioning method of any of claims 1-8.

13. A computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to carry out the operations performed in the video timing positioning method of any one of claims 1-8.