CN113473124A

CN113473124A - Information acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN113473124A
Application number: CN202110593819.3A
Authority: CN
Inventors: 章浩; 郭晓锋; 张德兵
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-10-01
Anticipated expiration: 2041-05-28
Also published as: CN113473124B

Abstract

The disclosure relates to an information acquisition method, an information acquisition device, an electronic device and a storage medium, wherein the method comprises the following steps: inputting a video frame to be judged into a first encoder to obtain an apparent characteristic corresponding to the video frame to be judged, and inputting motion information corresponding to the video frame to be judged into a second encoder to obtain a motion characteristic corresponding to the video frame to be judged; finding out a priori preset apparent feature with the shortest distance from the apparent feature corresponding to the video frame to be judged from the plurality of preset apparent features, and finding out a priori preset motion feature with the shortest distance from the motion feature corresponding to the video frame to be judged from the plurality of preset motion features; inputting the prior preset apparent characteristics into a first decoder to obtain a predicted video frame, and inputting the prior preset motion characteristics into a second decoder to obtain predicted motion information; and determining whether the video frame to be judged is an abnormal video frame or not based on the first error corresponding to the video frame to be judged and the second error corresponding to the video frame to be judged.

Description

Information acquisition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of videos, and in particular, to an information obtaining method, an information obtaining apparatus, an electronic device, and a storage medium.

Background

In reviewing whether the video is compliant, it may be determined whether a portion of frames in the video, i.e., video frames, are anomalous video frames to determine whether the video is associated with an anomalous event. In the related art, a neural network for identifying abnormal video frames is trained in a supervised manner. In order to achieve high accuracy of the neural network for identifying abnormal video frames, a large number of positive samples, that is, normal video frames, and a large number of negative samples, that is, abnormal video frames need to be collected, and meanwhile, a large number of normal video frames and a large number of abnormal video frames need to be labeled by related personnel, so that it is costly to determine whether a video frame is an abnormal video frame. In addition, the accuracy of determining whether a video frame is an abnormal video frame is affected by the collected negative samples, for example, for a certain type of abnormal event, the number of collected abnormal video frames related to the type of abnormal event is small or no abnormal video frame related to the type of abnormal event is collected, which may result in that the accuracy of identifying the abnormal video frame related to the type of abnormal event by the neural network for identifying the abnormal video frame after the training is completed is low.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides an information acquisition method, an apparatus, an electronic device, and a storage medium, so as to at least solve the problems in the related art that the cost of determining whether a video frame is an abnormal video frame is high and the accuracy of determining whether the video frame is an abnormal video frame is affected by collected negative samples. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an information acquisition method, including:

inputting a video frame to be judged into a first encoder to obtain an apparent characteristic corresponding to the video frame to be judged, and inputting motion information corresponding to the video frame to be judged into a second encoder to obtain a motion characteristic corresponding to the video frame to be judged, wherein the first encoder and the second encoder are both trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame which is not related to an abnormal condition;

finding out a priori preset apparent feature with the shortest distance from the apparent feature corresponding to the video frame to be judged from the plurality of preset apparent features, and finding out a priori preset motion feature with the shortest distance from the motion feature corresponding to the video frame to be judged from the plurality of preset motion features;

inputting the prior preset apparent feature into a first decoder to obtain a predicted video frame, and inputting the prior preset motion feature into a second decoder to obtain predicted motion information, wherein the first decoder and the second decoder are both trained on the basis of the training set in advance;

determining a first error corresponding to the video frame to be judged and a second error corresponding to the video frame to be judged, and determining whether the video frame to be judged is an abnormal video frame or not based on the first error and the second error, wherein the first error indicates a difference degree between a predicted video frame and the video frame to be judged, and the second error indicates a difference degree between predicted motion information and motion information corresponding to the video frame to be judged.

According to a second aspect of the embodiments of the present disclosure, there is provided an information acquisition apparatus including:

the system comprises an acquisition module, a first encoder and a second encoder, wherein the acquisition module is configured to input a video frame to be judged into the first encoder to obtain an apparent feature corresponding to the video frame to be judged, and input motion information corresponding to the video frame to be judged into the second encoder to obtain a motion feature corresponding to the video frame to be judged, the first encoder and the second encoder are both trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame unrelated to an abnormal condition;

the searching module is configured to search out a priori preset apparent feature with the shortest distance from the apparent feature corresponding to the video frame to be judged from the plurality of preset apparent features, and search out a priori preset motion feature with the shortest distance from the motion feature corresponding to the video frame to be judged from the plurality of preset motion features;

a decoding module configured to input the a priori preset apparent features into a first decoder to obtain a predicted video frame, and input the a priori preset motion features into a second decoder to obtain predicted motion information, wherein the first decoder and the second decoder are both trained in advance based on the training set;

the determining module is configured to determine a first error corresponding to the video frame to be determined, a second error corresponding to the video frame to be determined, and determine whether the video frame to be determined is an abnormal video frame based on the first error and the second error, wherein the first error indicates a difference degree between a predicted video frame and the video frame to be determined, and the second error indicates a difference degree between predicted motion information and motion information corresponding to the video frame to be determined.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

determining a first error corresponding to a video to be judged and a second error corresponding to the video to be judged by utilizing a first encoder, a second encoder, a first decoder and a second decoder, and determining whether the video frame to be judged is an abnormal video frame or not based on the first error corresponding to the video to be judged and the second error corresponding to the video to be judged. The training set utilized when the first encoder, the second encoder, the first decoder and the second decoder are trained in advance only comprises one type of video frame, namely a normal video frame, so that related personnel are not required to label the video frames in the training set, the condition that the cost is high because the related personnel are required to label a large number of normal video frames and a large number of abnormal video frames is avoided, and therefore, whether the video frames are abnormal or not is determined at low cost. Meanwhile, the accuracy of the video frame for determining whether the video frame is abnormal is not influenced by the negative sample, and the method is suitable for determining the abnormal video frame related to any type of abnormal event.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating one embodiment of an information acquisition method in accordance with an exemplary embodiment;

FIG. 2 is a schematic flow chart of obtaining a score of a video to be determined;

fig. 3 is a block diagram illustrating a structure of an information acquisition apparatus according to an exemplary embodiment;

fig. 4 is a block diagram illustrating a structure of an electronic device according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

FIG. 1 is a flow diagram illustrating one embodiment of an information acquisition method in accordance with an example embodiment. The method comprises the following steps:

step 101, inputting a video frame to be judged into a first encoder to obtain an apparent feature corresponding to the video frame to be judged, and inputting motion information corresponding to the video frame to be judged into a second encoder to obtain a motion feature corresponding to the video frame to be judged.

In this disclosure, a video frame to be determined does not refer to a certain video frame in a video to which the video frame to be determined belongs. When checking whether a video is related to an abnormal event, each video frame of at least a portion of video frames in the video may be respectively used as a video frame to be determined, and step 101-.

A normal video frame refers to a video frame that is not associated with an abnormal situation. The abnormal condition is caused by an apparent characteristic abnormality and/or a motion characteristic abnormality. An anomalous video frame refers to a video frame associated with an anomalous situation that typically occurs in a video that violates legal requirements, a video that is associated with an event that violates legal requirements, such as a violent event. The apparent features corresponding to the video frame to be judged may include color features of an object in the video frame to be judged and contour features of the object in the video frame to be judged. The motion characteristics corresponding to the video frame to be determined may include: the motion direction of the object in the video frame to be judged and the motion speed of the object in the video frame to be judged.

In the present disclosure, the first Encoder and the second Encoder may be Encoder-Decoder (Encoder-Decoder) type Encoder of an Encoder-Decoder type neural network. The first Decoder and the second Decoder may be Decoder decoders in a neural network of encoder-Decoder type.

In the present disclosure, the first encoder, the second encoder, the first decoder, and the second decoder are all trained in advance based on a training set, and each video frame in the training set is a normal video frame. The first encoder, the second encoder, the first decoder, and the second decoder may be pre-trained in an unsupervised manner based on a training set.

For example, each training, a normal video frame may be input to the first encoder to obtain the encoding result output by the first encoder, and the encoding result output by the first encoder may be input to the first decoder to obtain a reconstructed video frame corresponding to the normal video frame. Losses between the reconstructed video frame and the normal video frame may be calculated, and network parameters of the first encoder and network parameters of the first decoder may be updated based on the losses between the reconstructed video frame and the normal video frame. In each training, while the normal video frame is input into the first encoder, the motion information corresponding to the normal video frame may be input into the second encoder to obtain an encoding result output by the second encoder, and the motion information corresponding to the normal video frame may be the motion information between the normal video frame and the associated video frame of the normal video frame in the training set. The associated video frame of the normal video frame may be a video frame preceding the normal video frame in the video to which the normal video frame belongs. The encoding result output by the second encoder may be input to a second decoder to obtain reconstructed motion information corresponding to the normal video frame. The loss between the reconstructed motion information and the motion information corresponding to the normal video frame may be calculated, and the network parameter of the second encoder and the network parameter of the second decoder may be updated according to the loss between the reconstructed motion information and the motion information corresponding to the normal video frame.

In some embodiments, before inputting a video frame to be determined into a first encoder to obtain an apparent feature corresponding to the video frame to be determined, and inputting motion information corresponding to the video frame to be determined into a second encoder to obtain a motion feature corresponding to the video frame to be determined, the method further includes: determining an expected output video frame in a training set and a front preset number of video frames of the expected output video frame in the training set; inputting the video frames with the preset number into the first encoder to obtain a first encoding result output by the first encoder, and inputting the first encoding result into a first decoder to obtain a predicted output video frame corresponding to the expected output video frame; calculating a first loss corresponding to the expected output video frame, a second loss corresponding to the expected output video frame, a third loss corresponding to the expected output video frame, and a fourth loss corresponding to the expected output video frame, wherein the first loss is a loss between the predicted output video frame and the expected output video frame, the second loss is a loss between motion information corresponding to the predicted output video frame and motion information corresponding to the expected output video frame, the motion information corresponding to the predicted output video frame is motion information between the predicted output video frame and a video frame previous to the expected output video frame, the motion information corresponding to the expected output video frame is motion information between the expected output video frame and a video frame previous to the expected output video frame, the third loss is a difference between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame, the third loss is a loss between the predicted output video frame and the edge gradient of the expected output video frame, and the fourth loss is a loss of the expected output video frame, The fourth loss is a penalty between the predicted output video frame and the desired output video frame. Updating the network parameters of the first encoder and the first decoder based on the first loss, the second loss, the third loss and the fourth loss; inputting motion information corresponding to a video frame, which is used for predicting motion information, in each of the video frames in the previous preset number into a second encoder to obtain a second encoding result, and inputting the second encoding result into a second decoder to obtain predicted motion information corresponding to the video frame expected to be output, wherein the video frame used for predicting motion information is a video frame, except for the first video frame, in the previous preset number of video frames; and calculating a fifth loss corresponding to the expected output video frame, and updating the network parameters of the second encoder and the network parameters of the second decoder based on the fifth loss, wherein the fifth loss is the loss between the predicted motion information corresponding to the expected output video frame and the motion information corresponding to the expected output video frame.

In this disclosure, each training may determine an expected output video frame in a training set, and a first preset number of video frames of the expected output video frame in the training set, where the expected output video frame and the first preset number of video frames of the expected output video frame are from the same video, and the first preset number of video frames of the expected output video frame is a plurality of consecutive video frames. And the last video frame in the first preset number of video frames of the expected output video frame is the previous video frame of the expected output video frame.

For example, in one training process, the determined expected output video frame is the 5 th video frame in one video, the preset number is 4, and the first 4 video frames of the expected output video frame are composed of the first 4 video frames of the 5 th video frame in the video, that is, the 1 st to 4 th video frames in the video.

For each video frame used for predicting motion information in the previous preset number of video frames of the video frame expected to be output, the motion information corresponding to the video frame used for predicting motion information is the motion information between the video frame used for predicting motion information and the video frame which is one video frame before the video frame used for predicting motion information in the previous preset number of video frames.

The following illustrates the process of training the first encoder, the first decoder, the second encoder, and the second decoder at a time:

it is assumed that during a training process, the determined expected output video frame is the 5 th video frame in a video, and the 5 th video frame is simply referred to as the 5 th video frame. The preset number is 4, and the first preset number of video frames of the desired output video frame is the first 4 video frames of the 5 th video frame.

The first 4 video frames of the desired output video frame may be referred to simply as the first 4 video frames.

In the training process, for a first encoder and a first decoder, the first 4 video frames are input into the first encoder, the first encoder outputs a first encoding result, the first encoding result output by the first encoder is input into the first decoder, and the first decoder outputs a predicted output video frame corresponding to the 5 th video frame.

The predicted output video frame corresponding to the 5 th video frame may be referred to simply as the predicted 5 th video frame.

And calculating a first loss corresponding to the 5 th video frame, a second loss corresponding to the 5 th video frame, a third loss corresponding to the 5 th video frame and a fourth loss corresponding to the 5 th video frame.

The first loss is a loss between the predicted 5 th video frame and the 5 th video frame.

The previous video frame of the 5 th video frame is the 4 th video frame of the previous 4 video frames. The second loss is a loss between prediction of motion information corresponding to the 5 th video frame and motion information corresponding to the 5 th video frame, in other words, between prediction of an optical flow map corresponding to the 5 th video frame and an optical flow map corresponding to the 5 th video frame.

The motion information corresponding to the predicted 5 th video frame is the motion information between the predicted 5 th video frame and the 4 th video frame. The motion information corresponding to the 5 th video frame is the motion information between the 5 th video frame and the 4 th video frame.

The third penalty is the difference between the predicted edge gradient of the 5 th video frame and the edge gradient of the 5 th video frame. The fourth loss is a penalty loss between predicting the 5 th video frame and the 5 th video frame.

In the training process, for the second encoder and the second decoder, the motion information corresponding to the video frame in which each of the first 4 video frames is used for predicting the motion information is input into the second encoder, the second encoder outputs the second encoding result, the second encoding result output by the second encoder is input into the second decoder, and the second decoder outputs the predicted motion information corresponding to the 5 th video frame.

The video frames used for predicting the motion information are the video frames except the first video frame in the first 4 video frames, and the 2 nd video frame, the 3 rd video frame and the 4 th video frame in the first 4 video frames are all the video frames used for predicting the motion information. The motion information corresponding to the 2 nd video frame is the motion information between the 2 nd video frame and the 1 st video frame, the motion information corresponding to the 3 rd video frame is the motion information between the 3 rd video frame and the 2 nd video frame, and the motion information corresponding to the 4 th video frame is the motion information between the 4 th video frame and the 3 rd video frame.

And calculating corresponding fifth loss corresponding to the 5 th video frame, and updating the network parameters of the second encoder and the network parameters of the second decoder based on the fifth loss.

The fifth loss is a loss between the predicted motion information corresponding to the 5 th video frame and the motion information corresponding to the 5 th video frame.

In the present disclosure, when the first encoder, the second encoder, the first decoder, and the second decoder are trained in advance in a self-supervision manner, each time the first encoder, the second encoder, the first decoder, and the second decoder are trained, a plurality of video frames may participate in the training of the first encoder and the first decoder, and a plurality of motion information may participate in the training of the second encoder and the second decoder. Compared with the method that a single video frame participates in training each time, the video frames have richer apparent features, so that the first encoder and the first decoder can learn the richer apparent features, and the precision of the first encoder and the precision of the first decoder after the training is finished are further improved. For each time of training participated by single motion information, the multiple motion information has richer motion characteristics, so that the second encoder and the second decoder can learn richer motion characteristics, and the precision of the second encoder and the precision of the second decoder after the training is finished are further improved.

In the disclosure, a video frame to be determined may be input into a first encoder, and an apparent feature corresponding to the video frame to be determined output by the first encoder is obtained.

In the present disclosure, the motion feature corresponding to the video frame to be determined is a motion feature corresponding to the motion information corresponding to the video frame to be determined. And the motion information corresponding to the video frame to be judged is the motion information between the video frame to be judged and the associated video frame of the video frame to be judged. The video frame associated with the video frame to be determined may be a video frame for optical flow estimation located before the video frame to be determined in the video to which the video frame to be determined belongs, for example, the video frame associated with the video frame to be determined is a previous video frame of the video frame to be determined. The motion information between the video frame to be judged and the associated video frame of the video frame to be judged can be obtained by performing optical flow estimation on the video frame to be judged and the associated video frame of the video frame to be judged by utilizing a preset optical flow estimation algorithm.

In this disclosure, the motion information corresponding to the video frame to be determined may be input into the second encoder, so as to obtain the motion characteristic corresponding to the video frame to be determined output by the second encoder.

Step 102, finding out the prior preset apparent feature with the shortest distance to the apparent feature corresponding to the video frame to be judged from the plurality of preset apparent features, and finding out the prior preset motion feature with the shortest distance to the motion feature corresponding to the video frame to be judged from the plurality of preset motion features.

In the present disclosure, a preset first set of video frames for obtaining a plurality of preset apparent features may be acquired in advance. It is preset that each video frame in the first video frame set is a normal video frame. For example, a plurality of videos not associated with an abnormal event are acquired in advance, and each video frame in the videos not associated with the abnormal time is a normal video frame. For each acquired video, a portion of video frames are extracted from the video. All video frames extracted from the plurality of videos constitute a preset first video frame set.

The apparent feature corresponding to each video frame in the preset first set of video frames may be obtained using the first encoder. The apparent features corresponding to each video frame in the preset first video frame set can be clustered by adopting a preset clustering algorithm, such as a k-means clustering algorithm, so that a plurality of apparent feature clustering results and a clustering center corresponding to each apparent feature clustering result are obtained.

For each apparent feature clustering result, the apparent feature with the highest similarity to the clustering center corresponding to the apparent feature clustering result in the apparent feature clustering results can be determined as a preset apparent feature.

When the prior preset apparent feature with the shortest apparent feature distance corresponding to the video frame to be judged is searched, the preset apparent feature with the highest similarity of the apparent features corresponding to the video frame to be judged in the plurality of preset apparent features can be used as the prior preset apparent feature with the shortest apparent feature distance corresponding to the video frame to be judged.

In the present disclosure, a preset second video frame set for obtaining a plurality of preset motion characteristics may be acquired in advance. And presetting that each video frame in the second video frame set is a normal video frame.

The motion feature corresponding to each video frame in the preset second set of video frames may be obtained by the second encoder. The motion features corresponding to each video frame in the preset second video frame set can be clustered by adopting a preset clustering algorithm to obtain a plurality of motion feature clustering results and a clustering center corresponding to each motion feature clustering result.

For each motion feature clustering result, the motion feature with the highest similarity to the clustering center corresponding to the motion feature clustering result in the motion feature clustering results can be determined as a preset motion feature.

When searching for the prior preset motion feature with the shortest motion feature distance corresponding to the video frame to be determined, the preset motion feature with the highest similarity to the motion feature corresponding to the video frame to be determined in the plurality of preset motion features may be used as the prior preset motion feature with the shortest motion feature distance corresponding to the video frame to be determined.

In some embodiments, the preset apparent feature is a first clustering center, the first clustering center is a clustering center corresponding to an apparent feature clustering result, the apparent feature clustering result is obtained by clustering the apparent features corresponding to the video frames in a preset first video frame set, the preset motion feature is a second clustering center, the second clustering center is a clustering center corresponding to a motion feature clustering result, and the motion feature clustering result is obtained by clustering the motion features corresponding to the video frames in a preset second video frame set; finding out a priori preset apparent feature corresponding to the video frame to be judged from a plurality of preset apparent features comprises the following steps: calculating Euclidean distances between the apparent features corresponding to the video frames to be judged and each preset apparent feature; determining a preset apparent feature with the minimum Euclidean distance of the apparent features corresponding to the video frame to be judged as a prior preset apparent feature; finding out a priori preset motion characteristic corresponding to the video frame to be judged from a plurality of preset motion characteristics comprises the following steps: calculating the Euclidean distance between the motion characteristics corresponding to the video frame to be judged and each preset motion characteristic; and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the prior preset motion characteristic.

In this disclosure, a cluster center corresponding to an apparent feature clustering result obtained by clustering an apparent feature corresponding to each video frame in a preset first video frame set may be referred to as a first cluster center. The clustering center corresponding to the motion feature clustering result obtained by clustering the motion feature corresponding to each video frame in the preset second video frame set may be referred to as a second clustering center.

In the present disclosure, each first cluster center can be regarded as a preset apparent feature. Each second cluster center can be used as a preset motion characteristic. The preset apparent feature with the minimum Euclidean distance from the apparent features corresponding to the video frame to be judged in the plurality of preset apparent features can be determined as the prior preset apparent feature. The preset motion feature with the minimum Euclidean distance from the motion features corresponding to the video frame to be judged in the plurality of preset motion features can be determined as the prior preset motion feature.

In the disclosure, the euclidean distance between the priori preset apparent features and the apparent features corresponding to the video frames to be judged is minimum, the euclidean distance between the priori preset apparent features and the apparent features corresponding to the video frames to be judged can be secondarily expressed more accurately by using the priori preset apparent features, the euclidean distance between the priori preset motion features and the motion features corresponding to the video frames to be judged is minimum, the motion features corresponding to the video frames to be judged can be secondarily expressed more accurately by using the priori preset motion features, and then whether the video frames to be judged are abnormal or not can be determined more accurately by using the priori preset apparent features and the priori preset motion features.

Step 103, inputting the prior preset apparent characteristics into a first decoder to obtain a predicted video frame, and inputting the prior preset motion characteristics into a second decoder to obtain predicted motion information.

In the present disclosure, the a priori preset apparent features may be input into a first decoder, which outputs the predicted video frame. The a priori preset motion characteristics may be input to a second decoder, which outputs predicted motion information.

And 104, determining a first error corresponding to the video frame to be judged and a second error corresponding to the video frame to be judged, and determining whether the video frame to be judged is an abnormal video frame based on the first error corresponding to the video frame to be judged and the second error corresponding to the video frame to be judged.

In the disclosure, a first error corresponding to a video frame to be determined indicates a degree of difference between a predicted video frame and the video frame to be determined, and a second error corresponding to the video frame to be determined indicates a degree of difference between predicted motion information and motion information corresponding to the video frame to be determined. The loss between the predicted video frame and the video frame to be determined, such as L1 loss and L2 loss, may be calculated by using a preset loss function, and the loss between the predicted video frame and the video frame to be determined is taken as the first error corresponding to the video frame to be determined. The loss between the predicted motion information and the motion information corresponding to the video frame to be determined can be calculated by using a preset loss function, and the loss between the predicted motion information and the motion information corresponding to the video frame to be determined is used as a second error corresponding to the video frame to be determined.

In this disclosure, after determining the first error corresponding to the video frame to be determined and the second error corresponding to the video frame to be determined, it may be determined whether the video frame to be determined is an abnormal video frame based on the first error corresponding to the video frame to be determined and the second error corresponding to the video frame to be determined.

For example, the video frame to be determined may be determined to be an abnormal video frame when a first error corresponding to the video frame to be determined is greater than a first error threshold and/or a second error corresponding to the video frame to be determined is greater than a second error threshold.

In the disclosure, a first error corresponding to a video to be judged and a second error corresponding to the video to be judged are determined by using a first encoder, a second encoder, a first decoder and a second decoder, and whether a video frame to be judged is an abnormal video frame is determined based on the first error corresponding to the video to be judged and the second error corresponding to the video to be judged. The training set utilized when the first encoder, the second encoder, the first decoder and the second decoder are trained in advance only comprises one type of video frame, namely a normal video frame, so that related personnel are not required to label the video frames in the training set, the condition that the cost is high because the related personnel are required to label a large number of normal video frames and a large number of abnormal video frames is avoided, and therefore, whether the video frames are abnormal or not is determined at low cost. Meanwhile, the accuracy of the video frame for determining whether the video frame is abnormal is not influenced by the negative sample, and the method is suitable for determining the abnormal video frame related to any type of abnormal event.

In some embodiments, determining whether the video frame to be determined is an abnormal video frame based on the first error corresponding to the video frame to be determined and the second error corresponding to the video frame to be determined includes: calculating a normalized value of the first error and a normalized value of the second error; calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on a preset weight of the first error and a preset weight of the second error; determining the weighted sum as the score of the video frame to be judged; and under the condition that the score of the video frame to be judged is greater than the score threshold value, determining that the video frame to be judged is an abnormal video frame.

In the present disclosure, if each of the plurality of video frames including the video frame to be determined in the video to which the video frame to be determined belongs is an abnormal video frame, the step 101-104 is performed to determine whether the video frame is an abnormal video frame. A first error for each video frame and a second error for each video frame of the plurality of video frames may be determined, respectively. And normalizing the first error corresponding to each video frame in the plurality of video frames to obtain a normalized value of the first error corresponding to each video frame. And normalizing the second error corresponding to each video frame in the plurality of video frames to obtain a normalized value of the second error corresponding to each video frame. The weighted sum of the normalized value of the first error corresponding to the video frame to be judged and the normalized value of the second error corresponding to the video frame to be judged can be calculated based on the preset weight of the first error corresponding to the video frame to be judged and the preset weight of the second error corresponding to the video frame to be judged, and the weighted sum of the normalized value of the first error corresponding to the video frame to be judged and the normalized value of the second error corresponding to the video frame to be judged is determined as the score of the video frame to be judged. And comparing the score of the video frame to be judged with a score threshold value, and determining that the video frame to be judged is an abnormal video frame under the condition that the score of the video frame to be judged is greater than the score threshold value.

Please refer to fig. 2, which shows a flowchart of obtaining the score of the video to be determined.

And inputting the video frame to be judged into a first encoder, and outputting the apparent characteristics corresponding to the video frame to be judged by the first encoder. And obtaining the motion information corresponding to the video frame to be judged through optical flow extraction. And inputting the motion information corresponding to the video frame to be judged into a second encoder, and outputting the motion characteristics corresponding to the video frame to be judged by the second encoder. And finding out the prior preset apparent feature with the shortest distance from the apparent feature corresponding to the video frame to be judged from the plurality of preset apparent features, and finding out the prior preset motion feature with the shortest distance from the motion feature corresponding to the video frame to be judged from the plurality of preset motion features. The method comprises the steps of inputting a priori preset apparent characteristic into a first decoder, outputting a predicted video frame by the first decoder, inputting a priori preset motion characteristic into a second decoder, and outputting predicted motion information by the second decoder. And calculating the loss between the predicted video frame and the video frame to be judged, and taking the loss between the predicted video frame and the video frame to be judged as a first error corresponding to the video frame to be judged. And calculating the loss between the predicted motion information and the motion information corresponding to the video frame to be judged, and taking the loss between the predicted motion information and the motion information corresponding to the video frame to be judged as a second error corresponding to the video frame to be judged. Calculating a normalized value of the first error and a normalized value of the second error, calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on the preset weight of the first error and the preset weight of the second error, and determining the weighted sum as the score of the video frame to be judged.

In the disclosure, the degree of difference between the video frame to be determined and the normal video frame may be quantized into the score of the video frame to be determined, the score of the video frame to be determined may relatively accurately represent the degree of difference between the video frame to be determined and the normal video frame, and the score of the video frame to be determined, which relatively accurately represents the degree of difference between the video frame to be determined and the normal video frame, is compared with the score threshold to determine whether the video frame to be determined is an abnormal video frame, so that the video frame to be determined is an abnormal video frame. The related data are normalized, and the normalized values are used for calculation, so that the calculation precision can be improved. Meanwhile, when the score of the video to be judged is calculated, the correlation degree of a first error corresponding to the video frame to be judged, a second error corresponding to the video frame to be judged and the score of the video frame to be judged representing the difference degree between the video frame to be judged and the normal video frame is also considered, and the preset weight of the first error corresponding to the video frame to be judged and the preset weight of the second error corresponding to the video frame to be judged participate in the calculation, so that the calculated score of the video frame to be judged can more accurately represent the difference degree between the video frame to be judged and the normal video frame.

Fig. 2 is a block diagram illustrating a structure of an information acquisition apparatus according to an exemplary embodiment. Referring to fig. 3, the information acquisition apparatus includes: the device comprises an acquisition module 301, a search module 302, a decoding module 303 and a determination module 304.

The obtaining module 301 is configured to input a video frame to be determined into a first encoder to obtain an apparent feature corresponding to the video frame to be determined, and input motion information corresponding to the video frame to be determined into a second encoder to obtain a motion feature corresponding to the video frame to be determined, where the first encoder and the second encoder are both trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame unrelated to an abnormal condition;

the searching module 302 is configured to search for a priori preset apparent feature with a shortest distance to an apparent feature corresponding to the video frame to be judged from the plurality of preset apparent features, and search for a priori preset motion feature with a shortest distance to a motion feature corresponding to the video frame to be judged from the plurality of preset motion features;

the decoding module 303 is configured to input the a priori preset apparent features into a first decoder, resulting in a predicted video frame, and input the a priori preset motion features into a second decoder, resulting in predicted motion information, wherein the first decoder and the second decoder are both trained in advance based on the training set;

the determining module 304 is configured to determine a first error corresponding to the video frame to be determined, a second error corresponding to the video frame to be determined, and determine whether the video frame to be determined is an abnormal video frame based on the first error and the second error, where the first error indicates a degree of difference between a predicted video frame and the video frame to be determined, and the second error indicates a degree of difference between predicted motion information and motion information corresponding to the video frame to be determined.

In some embodiments, the determination module 304 is further configured to calculate a normalized value of the first error and a normalized value of the second error; calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on a preset weight of the first error and a preset weight of the second error; determining the weighted sum as the score of the video frame to be judged; and under the condition that the score of the video frame to be judged is greater than the score threshold value, determining that the video frame to be judged is an abnormal video frame.

In some embodiments, the preset apparent feature is a first clustering center, the first clustering center is a clustering center corresponding to an apparent feature clustering result, the apparent feature clustering result is obtained by clustering apparent features corresponding to video frames in a preset first video frame set, the preset motion feature is a second clustering center, the second clustering center is a clustering center corresponding to a motion feature clustering result, and the motion feature clustering result is obtained by clustering motion features corresponding to video frames in a preset second video frame set; the search module 202 is further configured to calculate a euclidean distance between the apparent features corresponding to the video frame to be determined and each preset apparent feature; determining a preset apparent feature with the minimum Euclidean distance of the apparent features corresponding to the video frame to be judged as the prior preset apparent feature; calculating the Euclidean distance between the motion characteristics corresponding to the video frame to be judged and each preset motion characteristic; and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the prior preset motion characteristic.

In some embodiments, the information acquisition apparatus further includes:

the training module is configured to determine an expected output video frame in the training set and a preset number of video frames before the expected output video frame in the training set before inputting a video frame to be judged into the first encoder to obtain an apparent feature corresponding to the video frame to be judged and inputting motion information corresponding to the video frame to be judged into the second encoder to obtain a motion feature corresponding to the video frame to be judged; inputting the video frames with the preset number into the first encoder to obtain a first encoding result, and inputting the first encoding result into the first decoder to obtain a predicted output video frame corresponding to the expected output video frame; calculating a first loss corresponding to the expected output video frame, a second loss corresponding to the expected output video frame, a third loss corresponding to the expected output video frame, and a fourth loss corresponding to the expected output video frame, wherein the first loss is a loss between the predicted output video frame and the expected output video frame, the second loss is a loss between motion information corresponding to the predicted output video frame and motion information corresponding to the expected output video frame, the motion information corresponding to the predicted output video frame is motion information between the predicted output video frame and a video frame previous to the expected output video frame, the motion information corresponding to the expected output video frame is motion information between the expected output video frame and a video frame previous to the expected output video frame, the third loss is a motion information between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame, and the third loss is a loss between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame The fourth loss is a penalty loss between the predicted output video frame and the desired output video frame; updating a network parameter of the first encoder and a network parameter of the first decoder based on the first loss, the second loss, the third loss, and the fourth loss; inputting motion information corresponding to a video frame, in which each of the video frames in the previous preset number of video frames is used for predicting motion information, into the second encoder to obtain a second encoding result, and inputting the second encoding result into the second decoder to obtain predicted motion information corresponding to the video frame expected to be output, wherein the video frame used for predicting motion information is a video frame, in the previous preset number of video frames, except for a first video frame; and calculating a fifth loss, and updating the network parameters of the second encoder and the network parameters of the second decoder based on the fifth loss, wherein the fifth loss is a loss between the predicted motion information corresponding to the expected output video frame and the motion information corresponding to the expected output video frame.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 4 is a block diagram illustrating a structure of an electronic device according to an example embodiment. Referring to fig. 4, the electronic device includes a processing component 422, which further includes one or more processors, and memory resources, represented by memory 432, for storing instructions, such as application programs, that are executable by the processing component 422. The application programs stored in memory 432 may include one or more modules that each correspond to a set of instructions. Further, the processing component 422 is configured to execute instructions to perform the above-described methods.

The electronic device may also include a power component 426 configured to perform power management of the electronic device, a wired or wireless network interface 450 configured to connect the electronic device to a network, and an input/output (I/O) interface 458. The electronic device may operate based on an operating system stored in memory 432, such as Windows Server, MacOS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a storage medium comprising instructions, such as a memory comprising instructions, executable by an electronic device to perform the above method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, the present application further provides a computer program product comprising computer readable code which, when run on an electronic device, causes the electronic device to perform the above-mentioned information acquisition method.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An information acquisition method, characterized in that the method comprises:

inputting a video frame to be judged into a first encoder to obtain an apparent feature corresponding to the video frame to be judged, and inputting motion information corresponding to the video frame to be judged into a second encoder to obtain a motion feature corresponding to the video frame to be judged, wherein the first encoder and the second encoder are both trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame which is not related to an abnormal condition;

finding out a priori preset apparent feature with the shortest distance from the apparent feature corresponding to the video frame to be judged from a plurality of preset apparent features, and finding out a priori preset motion feature with the shortest distance from the motion feature corresponding to the video frame to be judged from a plurality of preset motion features;

determining a first error corresponding to the video frame to be judged and a second error corresponding to the video frame to be judged, and determining whether the video frame to be judged is an abnormal video frame or not based on the first error and the second error, wherein the first error indicates a difference degree between the predicted video frame and the video frame to be judged, and the second error indicates a difference degree between the predicted motion information and the motion information corresponding to the video frame to be judged.

2. The method of claim 1, wherein determining whether the video frame to be determined is an abnormal video frame based on the first error and the second error comprises:

calculating a normalized value of the first error and a normalized value of the second error;

calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on a preset weight of the first error and a preset weight of the second error;

determining the weighted sum as the score of the video frame to be judged;

and under the condition that the score of the video frame to be judged is greater than the score threshold value, determining that the video frame to be judged is an abnormal video frame.

3. The method according to claim 1, wherein the preset apparent feature is a first clustering center, the first clustering center is a clustering center corresponding to an apparent feature clustering result, the apparent feature clustering result is obtained by clustering apparent features corresponding to video frames in a preset first video frame set, the preset motion feature is a second clustering center, the second clustering center is a clustering center corresponding to a motion feature clustering result, and the motion feature clustering result is obtained by clustering motion features corresponding to video frames in a preset second video frame set;

finding out a priori preset apparent feature corresponding to the video frame to be judged from a plurality of preset apparent features comprises the following steps:

calculating the Euclidean distance between the apparent feature corresponding to the video frame to be judged and each preset apparent feature;

determining a preset apparent feature with the minimum Euclidean distance of the apparent features corresponding to the video frame to be judged as the prior preset apparent feature; and

finding out a priori preset motion characteristic corresponding to the video frame to be judged from a plurality of preset motion characteristics comprises the following steps:

calculating the Euclidean distance between the motion characteristics corresponding to the video frame to be judged and each preset motion characteristic;

and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the prior preset motion characteristic.

4. The method according to claim 1, wherein before inputting a video frame to be determined into a first encoder to obtain an apparent feature corresponding to the video frame to be determined, and inputting motion information corresponding to the video frame to be determined into a second encoder to obtain a motion feature corresponding to the video frame to be determined, the method further comprises:

determining an expected output video frame in the training set and a preset number of video frames before the expected output video frame in the training set;

inputting the video frames with the preset number into the first encoder to obtain a first encoding result, and inputting the first encoding result into the first decoder to obtain a predicted output video frame corresponding to the expected output video frame;

calculating a first loss corresponding to the expected output video frame, a second loss corresponding to the expected output video frame, a third loss corresponding to the expected output video frame, and a fourth loss corresponding to the expected output video frame, wherein the first loss is a loss between the predicted output video frame and the expected output video frame, the second loss is a loss between motion information corresponding to the predicted output video frame and motion information corresponding to the expected output video frame, the motion information corresponding to the predicted output video frame is motion information between the predicted output video frame and a video frame previous to the expected output video frame, the motion information corresponding to the expected output video frame is motion information between the expected output video frame and a video frame previous to the expected output video frame, the third loss is a motion information between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame, and the third loss is a loss between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame The fourth loss is a penalty loss between the predicted output video frame and the desired output video frame;

updating a network parameter of the first encoder and a network parameter of the first decoder based on the first loss, the second loss, the third loss, and the fourth loss;

inputting motion information corresponding to a video frame, in which each of the video frames in the previous preset number of video frames is used for predicting motion information, into the second encoder to obtain a second encoding result, and inputting the second encoding result into the second decoder to obtain predicted motion information corresponding to the video frame expected to be output, wherein the video frame used for predicting motion information is a video frame, in the previous preset number of video frames, except for a first video frame;

and calculating a fifth loss, and updating the network parameters of the second encoder and the network parameters of the second decoder based on the fifth loss, wherein the fifth loss is a loss between the predicted motion information corresponding to the expected output video frame and the motion information corresponding to the expected output video frame.

5. An information acquisition apparatus, characterized in that the apparatus comprises:

the video processing device comprises an acquisition module, a judgment module and a processing module, wherein the acquisition module is configured to input a video frame to be judged into a first encoder to obtain an apparent feature corresponding to the video frame to be judged, and input motion information corresponding to the video frame to be judged into a second encoder to obtain a motion feature corresponding to the video frame to be judged, the first encoder and the second encoder are both trained on the basis of a training set in advance, each video frame in the training set is a normal video frame, and the normal video frame is a video frame which is not related to an abnormal condition;

the searching module is configured to search out a priori preset apparent feature with the shortest distance from the apparent feature corresponding to the video frame to be judged from a plurality of preset apparent features, and search out a priori preset motion feature with the shortest distance from the motion feature corresponding to the video frame to be judged from a plurality of preset motion features;

a decoding module configured to input the a priori preset apparent characteristics into a first decoder to obtain a predicted video frame, and input the a priori preset motion characteristics into a second decoder to obtain predicted motion information;

a determining module configured to determine a first error corresponding to the video frame to be determined, a second error corresponding to the video frame to be determined, and determine whether the video frame to be determined is an abnormal video frame based on the first error and the second error, wherein the first error indicates a degree of difference between the predicted video frame and the video frame to be determined, and the second error indicates a degree of difference between the predicted motion information and the motion information corresponding to the video frame to be determined.

6. The apparatus of claim 5, wherein the determination module is further configured to calculate a normalized value of the first error and a normalized value of the second error; calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on a preset weight of the first error and a preset weight of the second error; determining the weighted sum as the score of the video frame to be judged; and under the condition that the score of the video frame to be judged is greater than the score threshold value, determining that the video frame to be judged is an abnormal video frame.

7. The apparatus according to claim 5, wherein the preset apparent feature is a first clustering center, the first clustering center is a clustering center corresponding to an apparent feature clustering result, the apparent feature clustering result is obtained by clustering apparent features corresponding to video frames in a preset first video frame set, the preset motion feature is a second clustering center, the second clustering center is a clustering center corresponding to a motion feature clustering result, and the motion feature clustering result is obtained by clustering motion features corresponding to video frames in a preset second video frame set; the searching module is further configured to calculate Euclidean distances between the apparent features corresponding to the video frames to be judged and each preset apparent feature; determining a preset apparent feature with the minimum Euclidean distance of the apparent features corresponding to the video frame to be judged as the prior preset apparent feature; calculating the Euclidean distance between the motion characteristics corresponding to the video frame to be judged and each preset motion characteristic; and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the prior preset motion characteristic.

8. The apparatus of claim 5, further comprising:

the training module is configured to determine an expected output video frame in the training set and a first preset number of video frames of the expected output video frame in the training set before inputting a video frame to be judged into a first encoder to obtain an apparent feature corresponding to the video frame to be judged and inputting motion information corresponding to the video frame to be judged into a second encoder to obtain a motion feature corresponding to the video frame to be judged; inputting the video frames with the preset number into the first encoder to obtain a first encoding result, and inputting the first encoding result into the first decoder to obtain a predicted output video frame corresponding to the expected output video frame; calculating a first loss corresponding to the expected output video frame, a second loss corresponding to the expected output video frame, a third loss corresponding to the expected output video frame, and a fourth loss corresponding to the expected output video frame, wherein the first loss is a loss between the predicted output video frame and the expected output video frame, the second loss is a loss between motion information corresponding to the predicted output video frame and motion information corresponding to the expected output video frame, the motion information corresponding to the predicted output video frame is motion information between the predicted output video frame and a video frame previous to the expected output video frame, the motion information corresponding to the expected output video frame is motion information between the expected output video frame and a video frame previous to the expected output video frame, the third loss is a motion information between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame, and the third loss is a loss between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame The fourth loss is a penalty loss between the predicted output video frame and the desired output video frame; updating a network parameter of the first encoder and a network parameter of the first decoder based on the first loss, the second loss, the third loss, and the fourth loss; inputting motion information corresponding to a video frame, in which each of the video frames in the previous preset number of video frames is used for predicting motion information, into the second encoder to obtain a second encoding result, and inputting the second encoding result into the second decoder to obtain predicted motion information corresponding to the video frame expected to be output, wherein the video frame used for predicting motion information is a video frame, in the previous preset number of video frames, except for a first video frame; and calculating a fifth loss, and updating the network parameters of the second encoder and the network parameters of the second decoder based on the fifth loss, wherein the fifth loss is a loss between the predicted motion information corresponding to the expected output video frame and the motion information corresponding to the expected output video frame.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 4.

10. A computer-readable storage medium having stored therein instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-4.