CN113473124B

CN113473124B - Information acquisition method, device, electronic equipment and storage medium

Info

Publication number: CN113473124B
Application number: CN202110593819.3A
Authority: CN
Inventors: 章浩; 郭晓锋; 张德兵
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2024-02-06
Anticipated expiration: 2041-05-28
Also published as: CN113473124A

Abstract

The disclosure relates to an information acquisition method, an information acquisition device, an electronic device and a storage medium, wherein the method comprises the following steps: inputting the video frame to be judged into a first encoder to obtain the apparent characteristic corresponding to the video frame to be judged, and inputting the motion information corresponding to the video frame to be judged into a second encoder to obtain the motion characteristic corresponding to the video frame to be judged; searching a priori preset apparent characteristic with the shortest apparent characteristic distance corresponding to the video frame to be judged from a plurality of preset apparent characteristics, and searching a priori preset motion characteristic with the shortest motion characteristic distance corresponding to the video frame to be judged from a plurality of preset motion characteristics; inputting the priori preset apparent characteristics into a first decoder to obtain a predicted video frame, and inputting the priori preset motion characteristics into a second decoder to obtain predicted motion information; and determining whether the video frame to be judged is an abnormal video frame or not based on the first error corresponding to the video frame to be judged and the second error corresponding to the video frame to be judged.

Description

Information acquisition method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of video, and in particular, to an information acquisition method, an information acquisition device, an electronic device, and a storage medium.

Background

In auditing whether the video is compliant, it may be determined whether a portion of the frames in the video, i.e., the video frames, are anomalous video frames to determine whether the video is associated with an anomalous event. In the related art, a neural network for identifying abnormal video frames is trained in a supervised manner. In order to achieve high accuracy of the neural network for identifying abnormal video frames, a large number of positive samples, namely normal video frames, and a large number of negative samples, namely abnormal video frames, need to be collected, and meanwhile, a large number of normal video frames and a large number of abnormal video frames need to be marked by related personnel, so that the cost for realizing the determination of whether the video frames are abnormal video frames is high. Further, the accuracy of determining whether a video frame is an anomalous video frame is affected by the collected negative samples, e.g., for a certain type of anomalous event, the number of anomalous video frames collected in connection with that type of anomalous event is small or no anomalous video frames in connection with that type of anomalous event are collected, resulting in a neural network for identifying anomalous video frames after training is completed that has a lower accuracy in identifying anomalous video frames in connection with that type of anomalous event.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an information acquisition method, apparatus, electronic device, and storage medium to at least solve the problems in the related art that the cost of determining whether a video frame is an abnormal video frame is high and the accuracy of determining whether the abnormal video frame of the video frame is affected by the collected negative samples. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided an information acquisition method, including:

inputting a video frame to be judged into a first encoder to obtain apparent characteristics corresponding to the video frame to be judged, and inputting motion information corresponding to the video frame to be judged into a second encoder to obtain motion characteristics corresponding to the video frame to be judged, wherein the first encoder and the second encoder are trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame which is not associated with abnormal conditions;

searching a priori preset apparent characteristic with the shortest apparent characteristic distance corresponding to the video frame to be judged from a plurality of preset apparent characteristics, and searching a priori preset motion characteristic with the shortest motion characteristic distance corresponding to the video frame to be judged from a plurality of preset motion characteristics;

Inputting the prior preset apparent features into a first decoder to obtain a predicted video frame, and inputting the prior preset motion features into a second decoder to obtain predicted motion information, wherein the first decoder and the second decoder are trained in advance based on the training set;

determining a first error corresponding to the video frame to be judged, a second error corresponding to the video frame to be judged, and determining whether the video frame to be judged is an abnormal video frame or not based on the first error and the second error, wherein the first error indicates the difference degree between a predicted video frame and the video frame to be judged, and the second error indicates the difference degree between predicted motion information and motion information corresponding to the video frame to be judged.

According to a second aspect of the embodiments of the present disclosure, there is provided an information acquisition apparatus including:

the acquisition module is configured to input a video frame to be judged into the first encoder to obtain apparent characteristics corresponding to the video frame to be judged, and input motion information corresponding to the video frame to be judged into the second encoder to obtain motion characteristics corresponding to the video frame to be judged, wherein the first encoder and the second encoder are trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame which is not associated with abnormal conditions;

The searching module is configured to search out a priori preset apparent characteristic with the shortest apparent characteristic distance corresponding to the video frame to be judged from a plurality of preset apparent characteristics, and search out a priori preset motion characteristic with the shortest motion characteristic distance corresponding to the video frame to be judged from a plurality of preset motion characteristics;

the decoding module is configured to input the priori preset apparent features into a first decoder to obtain a predicted video frame, and input the priori preset motion features into a second decoder to obtain predicted motion information, wherein the first decoder and the second decoder are trained in advance based on the training set;

the determining module is configured to determine a first error corresponding to the video frame to be determined, a second error corresponding to the video frame to be determined, and determine whether the video frame to be determined is an abnormal video frame or not based on the first error and the second error, wherein the first error indicates a degree of difference between a predicted video frame and the video frame to be determined, and the second error indicates a degree of difference between motion information predicted and motion information corresponding to the video frame to be determined.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

and determining whether the video frame to be judged is an abnormal video frame or not by utilizing the first encoder, the second encoder, the first decoder and the second decoder to determine a first error corresponding to the video to be judged and a second error corresponding to the video to be judged. The training set utilized when the first encoder, the second encoder, the first decoder and the second decoder are trained in advance only comprises one type of video frame, namely a normal video frame, so that the video frames in the training set do not need to be marked by related personnel, the condition that a large number of normal video frames and a large number of abnormal video frames need to be marked by related personnel to cause high cost is avoided, and therefore, the determination of whether the video frames are abnormal video frames is realized at low cost. Meanwhile, the accuracy of the abnormal video frame for determining whether the video frame is abnormal is not influenced by a negative sample, and the method is suitable for determining the abnormal video frame related to any type of abnormal event.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart illustrating one embodiment of an information acquisition method according to one exemplary embodiment;

FIG. 2 is a schematic flow chart of obtaining a score of a video to be judged;

fig. 3 is a block diagram showing a structure of an information acquisition apparatus according to an exemplary embodiment;

fig. 4 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Fig. 1 is a flow chart illustrating one embodiment of an information acquisition method according to an exemplary embodiment. The method comprises the following steps:

step 101, inputting the video frame to be judged into a first encoder to obtain the apparent characteristic corresponding to the video frame to be judged, and inputting the motion information corresponding to the video frame to be judged into a second encoder to obtain the motion characteristic corresponding to the video frame to be judged.

In the present disclosure, a video frame to be determined is not particularly a certain video frame in a video to which the video frame to be determined belongs. When checking whether a video is related to an abnormal event, each video frame in at least part of the video frames can be used as a video frame to be judged, and steps 101-104 can be executed on each video frame respectively.

Normal video frames refer to video frames that are not associated with an abnormal situation. Abnormal conditions are caused by apparent feature anomalies and/or motion feature anomalies. An abnormal video frame refers to a video frame associated with an abnormal situation, which typically occurs in a video that violates legal regulations, a video that is related to an event that violates legal regulations, such as a violent event. The apparent characteristics corresponding to the video frame to be judged can comprise color characteristics of the object in the video frame to be judged and outline characteristics of the object in the video frame to be judged. The motion characteristics corresponding to the video frame to be judged can comprise: the motion direction of the object in the video frame to be judged and the motion speed of the object in the video frame to be judged.

In the present disclosure, the first Encoder, the second Encoder may be an Encoder in an Encoder-Decoder (Encoder-Decoder) type neural network. The first Decoder, the second Decoder may be a Decoder in a neural network of the encoder-Decoder type.

In the present disclosure, the first encoder, the second encoder, the first decoder, and the second decoder are all trained in advance based on a training set, and each video frame in the training set is a normal video frame. The first encoder, the second encoder, the first decoder, the second decoder may be pre-trained in a self-supervised manner based on the training set.

For example, each time training is performed, a normal video frame may be input to the first encoder, resulting in a coding result output by the first encoder, and the coding result output by the first encoder may be input to the first decoder, resulting in a reconstructed video frame corresponding to the normal video frame. A loss between the reconstructed video frame and the normal video frame may be calculated, and based on the loss between the reconstructed video frame and the normal video frame, network parameters of the first encoder, network parameters of the first decoder are updated. For each training, the normal video frame is input into the first encoder, and meanwhile, the motion information corresponding to the normal video frame can be input into the second encoder to obtain the encoding result output by the second encoder, and the motion information corresponding to the normal video frame can be the motion information between the normal video frame and the associated video frame of the normal video frame in the training set. The associated video frame of the normal video frame may be a previous video frame of the normal video frame in the video to which the normal video frame belongs. The encoding result output from the second encoder may be input to the second decoder to obtain reconstructed motion information corresponding to the normal video frame. The loss between the reconstructed motion information and the motion information corresponding to the normal video frame can be calculated, and the network parameters of the second encoder and the network parameters of the second decoder are updated according to the loss between the reconstructed motion information and the motion information corresponding to the normal video frame.

In some embodiments, before inputting the video frame to be judged into the first encoder to obtain the apparent feature corresponding to the video frame to be judged, and inputting the motion information corresponding to the video frame to be judged into the second encoder to obtain the motion feature corresponding to the video frame to be judged, the method further includes: determining expected output video frames in a training set and a preset number of video frames before the expected output video frames in the training set; inputting the preset number of video frames into the first encoder to obtain a first coding result output by the first encoder, and inputting the first coding result into a first decoder to obtain a predicted output video frame corresponding to the expected output video frame; calculating a first loss corresponding to the expected output video frame, a second loss corresponding to the expected output video frame, a third loss corresponding to the expected output video frame, and a fourth loss corresponding to the expected output video frame, wherein the first loss is a loss between the predicted output video frame and the expected output video frame, the second loss is a loss between motion information corresponding to the predicted output video frame and motion information corresponding to the expected output video frame, the motion information corresponding to the predicted output video frame is motion information between the predicted output video frame and a video frame preceding the expected output video frame, the motion information corresponding to the expected output video frame is motion information between the expected output video frame and a video frame preceding the expected output video frame, the third loss is a difference value between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame, and the fourth loss is a countering loss between the predicted output video frame and the expected output video frame. Updating network parameters of the first encoder, network parameters of the first decoder based on the first loss, the second loss, the third loss, and the fourth loss; inputting motion information corresponding to each video frame used for predicting motion information in the front preset number of video frames into a second encoder to obtain a second encoding result, and inputting the second encoding result into a second decoder to obtain predicted motion information corresponding to the expected output video frame, wherein the video frames used for predicting motion information are video frames except the first video frame in the front preset number of video frames; and calculating a fifth loss corresponding to the expected output video frame, and updating network parameters of the second encoder and the second decoder based on the fifth loss, wherein the fifth loss is a loss between the predicted motion information corresponding to the expected output video frame and the motion information corresponding to the expected output video frame.

In the present disclosure, each training may determine a desired output video frame in a training set, a pre-set number of video frames before the desired output video frame in the training set, the desired output video frame and the pre-set number of video frames before the desired output video frame from the same video, the pre-set number of video frames before the desired output video frame being a plurality of video frames in succession. The last video frame of the pre-set number of video frames preceding the desired output video frame is the previous video frame of the desired output video frame.

For example, in a training process, the determined expected output video frame is the 5 th video frame in one video, the preset number is 4, and the first 4 video frames of the expected output video frame are composed of the first 4 video frames of the 5 th video frame in the video, that is, the 1 st to 4 th video frames in the video.

For each of a pre-set number of video frames from which it is desired to output video frames for predicting motion information, the motion information corresponding to the video frame for predicting motion information is motion information between the video frame for predicting motion information and a video frame preceding the video frame for predicting motion information from among the pre-set number of video frames.

The following illustrates a process of training the first encoder, the first decoder, the second encoder, the second decoder once:

it is assumed that during one training, the determined desired output video frame is the 5 th video frame in one video, which is simply referred to as the 5 th video frame. The preset number is 4, and the previous preset number of video frames of the expected output video frame is the previous 4 video frames of the 5 th video frame.

The first 4 video frames of the desired output video frame may be referred to simply as the first 4 video frames.

In this training process, for the first encoder and the first decoder, the first 4 video frames are input into the first encoder, the first encoder outputs a first encoding result, the first encoding result output by the first encoder is input into the first decoder, and the first decoder outputs a predicted output video frame corresponding to the 5 th video frame.

The predicted output video frame corresponding to the 5 th video frame may be simply referred to as the predicted 5 th video frame.

And calculating a first loss corresponding to the 5 th video frame, a second loss corresponding to the 5 th video frame, a third loss corresponding to the 5 th video frame and a fourth loss corresponding to the 5 th video frame.

The first penalty is a penalty between predicting the 5 th video frame and the 5 th video frame.

The previous video frame of the 5 th video frame is the 4 th video frame of the 4 th video frames. The second loss is a loss between the motion information corresponding to the 5 th video frame and the motion information corresponding to the 5 th video frame, in other words, the second loss is a loss between the optical flow map corresponding to the 5 th video frame and the optical flow map corresponding to the 5 th video frame.

The motion information corresponding to the predicted 5 th video frame is the motion information between the predicted 5 th video frame and the 4 th video frame. The motion information corresponding to the 5 th video frame is the motion information between the 5 th video frame and the 4 th video frame.

The third penalty is the difference between the predicted edge gradient of the 5 th video frame and the edge gradient of the 5 th video frame. The fourth penalty is a penalty for predicting the contrast between the 5 th video frame and the 5 th video frame.

In the training process, for the second encoder and the second decoder, motion information corresponding to a video frame, which is used for predicting motion information, of each of the first 4 video frames is input into the second encoder, the second encoder outputs a second encoding result, the second encoding result output by the second encoder is input into the second decoder, and the second decoder outputs predicted motion information corresponding to the 5 th video frame.

The video frames used for predicting the motion information are video frames except for the first video frame in the first 4 video frames, and the 2 nd video frame, the 3 rd video frame and the 4 th video frame in the first 4 video frames are all video frames used for predicting the motion information. The motion information corresponding to the 2 nd video frame is the motion information between the 2 nd video frame and the 1 st video frame, the motion information corresponding to the 3 rd video frame is the motion information between the 3 rd video frame and the 2 nd video frame, and the motion information corresponding to the 4 th video frame is the motion information between the 4 th video frame and the 3 rd video frame.

And calculating a corresponding fifth loss corresponding to the 5 th video frame, and updating the network parameters of the second encoder and the network parameters of the second decoder based on the fifth loss.

The fifth loss is a loss between the predicted motion information corresponding to the 5 th video frame and the motion information corresponding to the 5 th video frame.

In the present disclosure, when the first encoder, the second encoder, the first decoder, and the second decoder are trained in advance in a self-supervision manner, each time the first encoder, the second encoder, the first decoder, and the second decoder are trained, a plurality of video frames can participate in the training of the first encoder and the first decoder, and a plurality of pieces of motion information can participate in the training of the second encoder and the second decoder. Compared with each single video frame participating in training, the plurality of video frames have richer apparent characteristics, so that the first encoder and the first decoder can learn the richer apparent characteristics, and the precision of the first encoder and the precision of the first decoder after the training is finished are further improved. Compared with each single motion information participating in training, the plurality of motion information has richer motion characteristics, so that the second encoder and the second decoder can learn richer motion characteristics, and the precision of the second encoder and the precision of the second decoder after the training is finished are further improved.

In the disclosure, a video frame to be judged may be input into a first encoder, so as to obtain an apparent characteristic corresponding to the video frame to be judged output by the first encoder.

In the present disclosure, the motion feature corresponding to the video frame to be determined is a motion feature corresponding to the motion information corresponding to the video frame to be determined. The motion information corresponding to the video frame to be judged is the motion information between the video frame to be judged and the associated video frame of the video frame to be judged. The associated video frame of the video frame to be judged may be a video frame for optical flow estimation located before the video frame to be judged in the video to which the video frame to be judged belongs, for example, the associated video frame of the video frame to be judged is a previous video frame of the video frame to be judged. The motion information between the video frame to be judged and the associated video frame of the video frame to be judged can be obtained by performing optical flow estimation on the video frame to be judged and the associated video frame of the video frame to be judged by utilizing a preset optical flow estimation algorithm.

In the disclosure, motion information corresponding to a video frame to be determined may be input to a second encoder, so as to obtain motion characteristics corresponding to the video frame to be determined output by the second encoder.

Step 102, finding out a priori preset apparent characteristic with the shortest apparent characteristic distance corresponding to the video frame to be judged from a plurality of preset apparent characteristics, and finding out a priori preset motion characteristic with the shortest motion characteristic distance corresponding to the video frame to be judged from a plurality of preset motion characteristics.

In the present disclosure, a preset first set of video frames for deriving a plurality of preset apparent features may be obtained in advance. Each video frame in the first video frame set is preset to be a normal video frame. For example, a plurality of videos not associated with an abnormal event are acquired in advance, and each video frame in the videos not associated with an abnormal time is a normal video frame. For each video acquired, a portion of the video frames is extracted from the video. All video frames extracted from the plurality of videos form a preset first video frame set.

The first encoder may be utilized to obtain an apparent characteristic corresponding to each video frame in a preset first set of video frames. And clustering the apparent features corresponding to each video frame in the preset first video frame set by adopting a preset clustering algorithm, such as a k-means clustering algorithm, so as to obtain a plurality of apparent feature clustering results and a clustering center corresponding to each apparent feature clustering result.

For each apparent feature clustering result, the apparent feature with the highest similarity of the clustering center corresponding to the apparent feature clustering result in the apparent feature clustering result can be determined as a preset apparent feature.

When searching the prior preset apparent characteristic with the shortest apparent characteristic distance corresponding to the video frame to be judged, the preset apparent characteristic with the highest similarity of the apparent characteristic corresponding to the video frame to be judged in the plurality of preset apparent characteristics can be used as the prior preset apparent characteristic with the shortest apparent characteristic distance corresponding to the video frame to be judged.

In the present disclosure, a preset second set of video frames for deriving a plurality of preset motion features may be acquired in advance. Each video frame in the second video frame set is preset to be a normal video frame.

And acquiring the motion characteristic corresponding to each video frame in the preset second video frame set by using the second encoder. And clustering the motion characteristics corresponding to each video frame in the second video frame set by adopting a preset clustering algorithm to obtain a plurality of motion characteristic clustering results and a clustering center corresponding to each motion characteristic clustering result.

For each motion feature clustering result, the motion feature with the highest similarity of the clustering center corresponding to the motion feature clustering result in the motion feature clustering result can be determined as a preset motion feature.

When searching the prior preset motion feature with the shortest motion feature distance corresponding to the video frame to be judged, the preset motion feature with the highest similarity of the motion feature corresponding to the video frame to be judged in the plurality of preset motion features can be used as the prior preset motion feature with the shortest motion feature distance corresponding to the video frame to be judged.

In some embodiments, the preset apparent feature is a first cluster center, the first cluster center is a cluster center corresponding to an apparent feature cluster result, the apparent feature cluster result is obtained by clustering apparent features corresponding to video frames in a preset first video frame set, the preset motion feature is a second cluster center, the second cluster center is a cluster center corresponding to a motion feature cluster result, and the motion feature cluster result is obtained by clustering motion features corresponding to video frames in a preset second video frame set; the searching the prior preset apparent characteristics corresponding to the video frame to be judged from the plurality of preset apparent characteristics comprises the following steps: calculating the Euclidean distance between the apparent feature corresponding to the video frame to be judged and each preset apparent feature; determining a preset apparent feature with the minimum Euclidean distance of the apparent feature corresponding to the video frame to be judged as a priori preset apparent feature; the searching the prior preset motion characteristics corresponding to the video frame to be judged from the preset motion characteristics comprises the following steps: calculating the Euclidean distance between the motion characteristic corresponding to the video frame to be judged and each preset motion characteristic; and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the priori preset motion characteristic.

In the present disclosure, a cluster center corresponding to an apparent feature clustering result obtained by clustering an apparent feature corresponding to each video frame in a preset first video frame set may be referred to as a first cluster center. The clustering center corresponding to the motion feature clustering result obtained by clustering the motion feature corresponding to each video frame in the preset second video frame set may be referred to as a second clustering center.

In the present disclosure, each first cluster center may be respectively used as a preset apparent feature. Each second hub may be referred to as a predetermined motion feature. The preset apparent feature with the smallest euclidean distance of the apparent feature corresponding to the video frame to be judged among the plurality of preset apparent features can be determined as the prior preset apparent feature. The preset motion feature with the smallest euclidean distance of the motion feature corresponding to the video frame to be judged among the plurality of preset motion features can be determined as the priori preset motion feature.

In the disclosure, the Euclidean distance between the prior preset apparent feature and the apparent feature corresponding to the video frame to be judged is minimum, the apparent feature corresponding to the video frame to be judged can be expressed more accurately twice by utilizing the prior preset apparent feature, the Euclidean distance between the prior preset motion feature and the motion feature corresponding to the video frame to be judged is minimum, the motion feature corresponding to the video frame to be judged can be expressed more accurately twice by utilizing the prior preset motion feature, and further whether the video frame to be judged is an abnormal video frame or not can be determined more accurately by utilizing the prior preset apparent feature and the prior preset motion feature.

Step 103, inputting the priori preset apparent features into the first decoder to obtain a predicted video frame, and inputting the priori preset motion features into the second decoder to obtain predicted motion information.

In the present disclosure, a priori preset apparent characteristics may be input into a first decoder, which outputs a predicted video frame. The a priori preset motion characteristics may be input into a second decoder, which outputs the predicted motion information.

Step 104, determining a first error corresponding to the video frame to be judged and a second error corresponding to the video frame to be judged, and determining whether the video frame to be judged is an abnormal video frame based on the first error corresponding to the video frame to be judged and the second error corresponding to the video frame to be judged.

In the present disclosure, a first error corresponding to a video frame to be judged indicates a degree of difference between a predicted video frame and the video frame to be judged, and a second error corresponding to the video frame to be judged indicates a degree of difference between predicted motion information and motion information corresponding to the video frame to be judged. The loss between the predicted video frame and the video frame to be judged, such as L1 loss and L2 loss, can be calculated by using a preset loss function, and the loss between the predicted video frame and the video frame to be judged is used as a first error corresponding to the video frame to be judged. The loss between the predicted motion information and the motion information corresponding to the video frame to be judged can be calculated by using a preset loss function, and the loss between the predicted motion information and the motion information corresponding to the video frame to be judged is used as a second error corresponding to the video frame to be judged.

In the present disclosure, after determining the first error corresponding to the video frame to be determined and the second error corresponding to the video frame to be determined, whether the video frame to be determined is an abnormal video frame may be determined based on the first error corresponding to the video frame to be determined and the second error corresponding to the video frame to be determined.

For example, it may be determined that the video frame to be determined is an abnormal video frame when the first error corresponding to the video frame to be determined is greater than the first error threshold and/or the second error corresponding to the video frame to be determined is greater than the second error threshold.

In the disclosure, a first encoder, a second encoder, a first decoder, and a second decoder are utilized to determine a first error corresponding to a video to be determined and a second error corresponding to the video to be determined, and whether a video frame to be determined is an abnormal video frame is determined based on the first error corresponding to the video to be determined and the second error corresponding to the video to be determined. The training set utilized when the first encoder, the second encoder, the first decoder and the second decoder are trained in advance only comprises one type of video frame, namely a normal video frame, so that the video frames in the training set do not need to be marked by related personnel, the condition that a large number of normal video frames and a large number of abnormal video frames need to be marked by related personnel to cause high cost is avoided, and therefore, the determination of whether the video frames are abnormal video frames is realized at low cost. Meanwhile, the accuracy of the abnormal video frame for determining whether the video frame is abnormal is not influenced by a negative sample, and the method is suitable for determining the abnormal video frame related to any type of abnormal event.

In some embodiments, determining whether the video frame to be determined is an abnormal video frame based on the first error corresponding to the video frame to be determined and the second error corresponding to the video frame to be determined includes: calculating a normalized value of the first error and a normalized value of the second error; calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on the preset weight of the first error and the preset weight of the second error; determining the weighted sum as the score of the video frame to be judged; and under the condition that the score of the video frame to be judged is larger than the score threshold value, determining that the video frame to be judged is an abnormal video frame.

In the present disclosure, if each of a plurality of video frames including the video frame to be determined in the video to which the video frame to be determined belongs, steps 101 to 104 are performed to determine whether the video frame is an abnormal video frame. A first error corresponding to each video frame of the plurality of video frames and a second error corresponding to each video frame may be determined separately. And carrying out normalization processing on the first error corresponding to each video frame in the plurality of video frames to obtain a normalized value of the first error corresponding to each video frame. And carrying out normalization processing on the second error corresponding to each video frame in the plurality of video frames to obtain a normalized value of the second error corresponding to each video frame. The weighted sum of the normalized value of the first error corresponding to the video frame to be judged and the normalized value of the second error corresponding to the video frame to be judged can be calculated based on the preset weight of the first error corresponding to the video frame to be judged and the preset weight of the second error corresponding to the video frame to be judged, and the weighted sum of the normalized value of the first error corresponding to the video frame to be judged and the normalized value of the second error corresponding to the video frame to be judged is determined as the score of the video frame to be judged. And comparing the score of the video frame to be judged with a score threshold value, and determining that the video frame to be judged is an abnormal video frame under the condition that the score of the video frame to be judged is larger than the score threshold value.

Referring to fig. 2, a flowchart of obtaining a score of a video to be determined is shown.

And inputting the video frame to be judged into a first encoder, and outputting apparent characteristics corresponding to the video frame to be judged by the first encoder. And obtaining the motion information corresponding to the video frame to be judged through optical flow extraction. And inputting the motion information corresponding to the video frame to be judged into a second encoder, and outputting the motion characteristics corresponding to the video frame to be judged by the second encoder. The prior preset apparent characteristics with the shortest apparent characteristic distance corresponding to the video frame to be judged are searched out from the preset apparent characteristics, and the prior preset motion characteristics with the shortest motion characteristic distance corresponding to the video frame to be judged are searched out from the preset motion characteristics. The prior preset apparent features are input into a first decoder, the first decoder outputs predicted video frames, the prior preset motion features are input into a second decoder, and the second decoder outputs predicted motion information. And calculating the loss between the predicted video frame and the video frame to be judged, and taking the loss between the predicted video frame and the video frame to be judged as a first error corresponding to the video frame to be judged. And calculating the loss between the predicted motion information and the motion information corresponding to the video frame to be judged, and taking the loss between the predicted motion information and the motion information corresponding to the video frame to be judged as a second error corresponding to the video frame to be judged. Calculating the normalized value of the first error and the normalized value of the second error, calculating the weighted sum of the normalized value of the first error and the normalized value of the second error based on the preset weight of the first error and the preset weight of the second error, and determining the weighted sum as the score of the video frame to be judged.

In the present disclosure, the difference between the to-be-determined video frame and the normal video frame may be measured as a score of the to-be-determined video frame, the score of the to-be-determined video frame may more accurately represent the degree of difference between the to-be-determined video frame and the normal video frame, and the score of the to-be-determined video frame, which more accurately represents the degree of difference between the to-be-determined video frame and the normal video frame, may be compared with a score threshold to determine whether the to-be-determined video frame is an abnormal video frame, thereby more accurately determining that the to-be-determined video frame is an abnormal video frame. In the disclosure, when calculating the score of the video frame to be judged, the score of the video frame to be judged can be calculated by using the normalized value of the first error corresponding to the video frame to be judged and the normalized value of the second error corresponding to the video frame to be judged, so that the accuracy of the score of the video frame to be judged can be further improved. Meanwhile, when calculating the score of the video to be judged, the correlation degree of the score of the video to be judged, which is the degree of difference between the video to be judged and the normal video frame, and the first error corresponding to the video to be judged, the second error corresponding to the video to be judged and the second error corresponding to the video to be judged are also considered, and the preset weight of the first error corresponding to the video to be judged and the preset weight of the second error corresponding to the video to be judged are involved in calculation, so that the calculated score of the video to be judged can more accurately represent the degree of difference between the video to be judged and the normal video frame.

Fig. 2 is a block diagram showing a structure of an information acquisition apparatus according to an exemplary embodiment. Referring to fig. 3, the information acquisition apparatus includes: the device comprises an acquisition module 301, a search module 302, a decoding module 303 and a determination module 304.

The obtaining module 301 is configured to input a video frame to be judged into a first encoder to obtain an apparent feature corresponding to the video frame to be judged, and input motion information corresponding to the video frame to be judged into a second encoder to obtain a motion feature corresponding to the video frame to be judged, wherein the first encoder and the second encoder are trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame which is not associated with an abnormal situation;

the searching module 302 is configured to search out a priori preset apparent feature with the shortest apparent feature distance corresponding to the video frame to be judged from a plurality of preset apparent features, and search out a priori preset motion feature with the shortest motion feature distance corresponding to the video frame to be judged from a plurality of preset motion features;

the decoding module 303 is configured to input a priori preset apparent features into a first decoder to obtain a predicted video frame, and input a priori preset motion features into a second decoder to obtain predicted motion information, wherein the first decoder and the second decoder are trained in advance based on the training set;

The determining module 304 is configured to determine a first error corresponding to the video frame to be determined, a second error corresponding to the video frame to be determined, and determine whether the video frame to be determined is an abnormal video frame based on the first error and the second error, where the first error indicates a degree of difference between a predicted video frame and the video frame to be determined, and the second error indicates a degree of difference between motion information predicted and motion information corresponding to the video frame to be determined.

In some embodiments, the determination module 304 is further configured to calculate a normalized value of the first error and a normalized value of the second error; calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on the preset weight of the first error and the preset weight of the second error; determining the weighted sum as the score of the video frame to be judged; and under the condition that the score of the video frame to be judged is larger than a score threshold value, determining that the video frame to be judged is an abnormal video frame.

In some embodiments, the preset apparent feature is a first cluster center, the first cluster center is a cluster center corresponding to an apparent feature cluster result, the apparent feature cluster result is obtained by clustering apparent features corresponding to video frames in a preset first video frame set, the preset motion feature is a second cluster center, the second cluster center is a cluster center corresponding to a motion feature cluster result, and the motion feature cluster result is obtained by clustering motion features corresponding to video frames in a preset second video frame set; the searching module 202 is further configured to calculate the euclidean distance between the apparent feature corresponding to the video frame to be judged and each preset apparent feature; determining a preset apparent feature with the minimum Euclidean distance of the apparent feature corresponding to the video frame to be judged as the prior preset apparent feature; calculating the Euclidean distance between the motion characteristic corresponding to the video frame to be judged and each preset motion characteristic; and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the priori preset motion characteristic.

In some embodiments, the information acquisition apparatus further comprises:

the training module is configured to determine expected output video frames in the training set and a preset number of video frames before inputting the video frames to be judged into the first encoder to obtain apparent characteristics corresponding to the video frames to be judged and inputting motion information corresponding to the video frames to be judged into the second encoder to obtain the motion characteristics corresponding to the video frames to be judged; inputting the preset number of video frames into the first encoder to obtain a first coding result, and inputting the first coding result into the first decoder to obtain a predicted output video frame corresponding to the expected output video frame; calculating a first loss corresponding to the expected output video frame, a second loss corresponding to the expected output video frame, a third loss corresponding to the expected output video frame, and a fourth loss corresponding to the expected output video frame, wherein the first loss is a loss between the predicted output video frame and the expected output video frame, the second loss is a loss between motion information corresponding to the predicted output video frame and motion information corresponding to the expected output video frame, the motion information corresponding to the predicted output video frame is motion information between the predicted output video frame and a video frame preceding the expected output video frame, the motion information corresponding to the expected output video frame is motion information between the expected output video frame and a video frame preceding the expected output video frame, the third loss is a difference value between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame, and the fourth loss is a countermeasure loss between the predicted output video frame and the expected output video frame; updating network parameters of the first encoder, network parameters of the first decoder based on the first loss, the second loss, the third loss, and the fourth loss; inputting motion information corresponding to each video frame used for predicting motion information in the front preset number of video frames into the second encoder to obtain a second encoding result, and inputting the second encoding result into the second decoder to obtain predicted motion information corresponding to the expected output video frame, wherein the video frames used for predicting motion information are video frames except for the first video frame in the front preset number of video frames; and calculating a fifth loss, and updating network parameters of the second encoder and the second decoder based on the fifth loss, wherein the fifth loss is a loss between the predicted motion information corresponding to the expected output video frame and the motion information corresponding to the expected output video frame.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 4 is a block diagram of an electronic device, according to an example embodiment. Referring to fig. 4, the electronic device includes a processing component 422 that further includes one or more processors, and memory resources represented by memory 432, for storing instructions, such as application programs, executable by the processing component 422. The application program stored in memory 432 may include one or more modules each corresponding to a set of instructions. Further, the processing component 422 is configured to execute instructions to perform the above-described methods.

The electronic device may also include a power component 426 configured to perform power management of the electronic device, a wired or wireless network interface 450 configured to connect the electronic device to a network, and an input output (I/O) interface 458. The electronic device may operate based on an operating system stored in memory 432, such as Windows Server, macOS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In an exemplary embodiment, a storage medium is also provided, e.g., a memory, comprising instructions executable by an electronic device to perform the above-described method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, the present application also provides a computer program product comprising computer readable code which, when run on an electronic device, causes the electronic device to perform the above-described information acquisition method.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An information acquisition method, the method comprising:

Determining a first error corresponding to the video frame to be judged, a second error corresponding to the video frame to be judged, and determining whether the video frame to be judged is an abnormal video frame or not based on the first error and the second error, wherein the first error indicates the difference degree between the predicted video frame and the video frame to be judged, and the second error indicates the difference degree between the predicted motion information and the motion information corresponding to the video frame to be judged.

2. The method of claim 1, wherein determining whether the video frame to be determined is an anomalous video frame based on the first error and the second error comprises:

calculating a normalized value of the first error and a normalized value of the second error;

calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on the preset weight of the first error and the preset weight of the second error;

determining the weighted sum as the score of the video frame to be judged;

and under the condition that the score of the video frame to be judged is larger than a score threshold value, determining that the video frame to be judged is an abnormal video frame.

3. The method according to claim 1, wherein the preset apparent feature is a first cluster center, the first cluster center is a cluster center corresponding to an apparent feature cluster result, the apparent feature cluster result is obtained by clustering apparent features corresponding to video frames in a preset first video frame set, the preset motion feature is a second cluster center, the second cluster center is a cluster center corresponding to a motion feature cluster result, and the motion feature cluster result is obtained by clustering motion features corresponding to video frames in a preset second video frame set;

the searching the prior preset apparent characteristics corresponding to the video frame to be judged from a plurality of preset apparent characteristics comprises the following steps:

calculating Euclidean distance between the apparent features corresponding to the video frames to be judged and each preset apparent feature;

determining a preset apparent feature with the minimum Euclidean distance of the apparent feature corresponding to the video frame to be judged as the prior preset apparent feature; and

the searching the prior preset motion characteristics corresponding to the video frame to be judged from a plurality of preset motion characteristics comprises the following steps:

Calculating Euclidean distance between the motion characteristics corresponding to the video frame to be judged and each preset motion characteristic;

and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the priori preset motion characteristic.

4. The method of claim 1, wherein before inputting the video frame to be determined into the first encoder to obtain the apparent feature corresponding to the video frame to be determined, and inputting the motion information corresponding to the video frame to be determined into the second encoder to obtain the motion feature corresponding to the video frame to be determined, the method further comprises:

determining expected output video frames in the training set and a preset number of video frames before the expected output video frames in the training set;

inputting the preset number of video frames into the first encoder to obtain a first coding result, and inputting the first coding result into the first decoder to obtain a predicted output video frame corresponding to the expected output video frame;

calculating a first loss corresponding to the expected output video frame, a second loss corresponding to the expected output video frame, a third loss corresponding to the expected output video frame, and a fourth loss corresponding to the expected output video frame, wherein the first loss is a loss between the predicted output video frame and the expected output video frame, the second loss is a loss between motion information corresponding to the predicted output video frame and motion information corresponding to the expected output video frame, the motion information corresponding to the predicted output video frame is motion information between the predicted output video frame and a video frame preceding the expected output video frame, the motion information corresponding to the expected output video frame is motion information between the expected output video frame and a video frame preceding the expected output video frame, the third loss is a difference value between an edge gradient of the predicted output video frame and an edge gradient of the expected output video frame, and the fourth loss is a countermeasure loss between the predicted output video frame and the expected output video frame;

Updating network parameters of the first encoder, network parameters of the first decoder based on the first loss, the second loss, the third loss, and the fourth loss;

inputting motion information corresponding to each video frame used for predicting motion information in the front preset number of video frames into the second encoder to obtain a second encoding result, and inputting the second encoding result into the second decoder to obtain predicted motion information corresponding to the expected output video frame, wherein the video frames used for predicting motion information are video frames except for the first video frame in the front preset number of video frames;

and calculating a fifth loss, and updating network parameters of the second encoder and the second decoder based on the fifth loss, wherein the fifth loss is a loss between the predicted motion information corresponding to the expected output video frame and the motion information corresponding to the expected output video frame.

5. An information acquisition apparatus, characterized in that the apparatus comprises:

the acquisition module is configured to input a video frame to be judged into a first encoder to obtain apparent characteristics corresponding to the video frame to be judged, and input motion information corresponding to the video frame to be judged into a second encoder to obtain motion characteristics corresponding to the video frame to be judged, wherein the first encoder and the second encoder are trained in advance based on a training set, each video frame in the training set is a normal video frame, and the normal video frame is a video frame which is not associated with abnormal conditions;

The searching module is configured to search a priori preset apparent characteristic with the shortest apparent characteristic distance corresponding to the video frame to be judged from a plurality of preset apparent characteristics, and search a priori preset motion characteristic with the shortest motion characteristic distance corresponding to the video frame to be judged from a plurality of preset motion characteristics;

the decoding module is configured to input the priori preset apparent characteristics into a first decoder to obtain a predicted video frame, and input the priori preset motion characteristics into a second decoder to obtain predicted motion information;

the determining module is configured to determine a first error corresponding to the video frame to be determined, a second error corresponding to the video frame to be determined, and determine whether the video frame to be determined is an abnormal video frame or not based on the first error and the second error, wherein the first error indicates a difference degree between the predicted video frame and the video frame to be determined, and the second error indicates a difference degree between the predicted motion information and the motion information corresponding to the video frame to be determined.

6. The apparatus of claim 5, wherein the determination module is further configured to calculate a normalized value of the first error and a normalized value of the second error; calculating a weighted sum of the normalized value of the first error and the normalized value of the second error based on the preset weight of the first error and the preset weight of the second error; determining the weighted sum as the score of the video frame to be judged; and under the condition that the score of the video frame to be judged is larger than a score threshold value, determining that the video frame to be judged is an abnormal video frame.

7. The apparatus of claim 5, wherein the preset apparent feature is a first cluster center, the first cluster center is a cluster center corresponding to an apparent feature cluster result, the apparent feature cluster result is obtained by clustering apparent features corresponding to video frames in a preset first video frame set, the preset motion feature is a second cluster center, the second cluster center is a cluster center corresponding to a motion feature cluster result, and the motion feature cluster result is obtained by clustering motion features corresponding to video frames in a preset second video frame set; the searching module is further configured to calculate the Euclidean distance between the apparent feature corresponding to the video frame to be judged and each preset apparent feature; determining a preset apparent feature with the minimum Euclidean distance of the apparent feature corresponding to the video frame to be judged as the prior preset apparent feature; calculating Euclidean distance between the motion characteristics corresponding to the video frame to be judged and each preset motion characteristic; and determining the preset motion characteristic with the minimum Euclidean distance of the motion characteristic corresponding to the video frame to be judged as the priori preset motion characteristic.

8. The apparatus of claim 5, wherein the apparatus further comprises:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 4.

10. A computer readable storage medium, which when stored instructions are executed by a processor of an electronic device, causes the electronic device to perform the method of any one of claims 1 to 4.

11. A computer program product comprising computer readable code which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1 to 4.