WO2023142550A1 - 异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品 - Google Patents

异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品 Download PDF

Info

Publication number
WO2023142550A1
WO2023142550A1 PCT/CN2022/127087 CN2022127087W WO2023142550A1 WO 2023142550 A1 WO2023142550 A1 WO 2023142550A1 CN 2022127087 W CN2022127087 W CN 2022127087W WO 2023142550 A1 WO2023142550 A1 WO 2023142550A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
image
features
scale
convolution
Prior art date
Application number
PCT/CN2022/127087
Other languages
English (en)
French (fr)
Inventor
李国球
蔡官熊
曾星宇
赵瑞
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023142550A1 publication Critical patent/WO2023142550A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Definitions

  • the embodiment of the present disclosure is based on the Chinese patent application with the application number 202210103096.9, the application date is January 27, 2022, and the application name is "abnormal event detection method and device, computer equipment, storage medium", and requires the Chinese patent application Priority, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.
  • the present disclosure relates to the technical field of computer vision, and in particular to an abnormal event detection method and device, computer equipment, storage media, computer programs, and computer program products.
  • Video anomaly detection methods aim to capture abnormal events in videos and determine the time interval of their occurrence.
  • Anomaly events refer to behaviors that do not meet expectations and rarely occur. How to improve the accuracy of abnormal event detection has always attracted much attention.
  • the present disclosure provides an abnormal event detection method and device, computer equipment, storage media, computer programs, and computer program products.
  • An embodiment of the present disclosure provides a method for detecting an abnormal event, including: acquiring at least two image sequences; each of the image sequences includes at least one frame of image; and dividing each of the image sequences into at least two scales , to obtain an image block set composed of image blocks at the same position in all image frames at the same scale; based on the image block sets of each of the image sequences, determine the correlation characteristics between each of the image sequences; according to each of the image sequences Correlation features among the at least two image sequences are determined to determine a target image sequence in which an abnormal event exists.
  • An embodiment of the present disclosure provides an abnormal event detection device, including: an acquisition module configured to acquire at least two image sequences; wherein, each of the image sequences includes at least one frame of image; a division module configured to The above image sequence is divided into at least two scales to obtain an image block set composed of image blocks at the same position in all image frames under the same scale; the first determining module is configured to determine each image block set based on each of the image sequences The correlation feature between the image sequences; the second determination module is configured to determine a target image sequence with an abnormal event in the at least two image sequences according to the correlation feature between the image sequences.
  • An embodiment of the present disclosure provides a computer device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the method for detecting an abnormal event as described in the first aspect above.
  • An embodiment of the present disclosure provides a storage medium, including: when instructions in the storage medium are executed by a processor of the device, the device can execute the method for detecting an abnormal event as described in the first aspect above.
  • An embodiment of the present disclosure provides a computer program, the computer program includes computer readable code, and when the computer readable code is read and executed by a computer, a part or part of the method in any embodiment of the present disclosure is realized. All steps.
  • An embodiment of the present disclosure provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and when the computer program is read and executed by a computer, any embodiment of the present disclosure is realized Some or all of the steps in the method.
  • the present disclosure performs multi-scale division for each frame of image in each image sequence, which can improve the robustness of abnormal event detection in the middle scale.
  • this disclosure determines the correlation features between image sequences based on the image block sets of each image sequence, so that the abnormal event detection device can combine the correlation between image sequences on a multi-scale basis to improve the detection accuracy of abnormal events .
  • FIG. 1 is a flow chart 1 of an abnormal event detection method shown in an embodiment of the present disclosure
  • FIG. 2 is an example diagram of a scale division shown in an embodiment of the present disclosure
  • FIG. 3 is a second flow chart of an abnormal event detection method shown in an embodiment of the present disclosure.
  • FIG. 4 is an example diagram of the principle of obtaining the first feature based on the first splicing feature in an embodiment of the present disclosure
  • FIG. 5 is an example diagram of a principle of feature fusion in an embodiment of the present disclosure
  • FIG. 6 is a flowchart three of an abnormal event detection method in an embodiment of the present disclosure.
  • FIG. 7 is a fourth flowchart of an abnormal event detection method in an embodiment of the present disclosure.
  • FIG. 8A is a schematic diagram of an abnormal event detection method shown in an embodiment of the present disclosure.
  • FIG. 8B is a schematic diagram of the processing process of some modules in FIG. 8A shown in an embodiment of the present disclosure
  • FIG. 9 is a diagram of an abnormal event detection device shown in an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure.
  • the abnormal event detection method provided by the embodiments of the present disclosure may be executed by an abnormal event detection device.
  • the abnormal event detection method may be executed by a terminal device or a server or other electronic devices, wherein the terminal device may be a user equipment (User Equipment) , UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, vehicle-mounted device, wearable device, etc.
  • the method for detecting an abnormal event may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • the abnormal event detection device may include an image acquisition component, so that the continuous frame images of a certain scene are acquired by the image acquisition component, and at least two image sequences are obtained by dividing them.
  • the image acquisition component is a camera, which can collect video at a fixed location, and the abnormal event detection device including the camera can divide the video into at least two image sequences in the time dimension, and an image sequence can be called a video Segments, the image frames included in different video segments may not overlap.
  • the abnormal event detection device may not include an image acquisition component, and the abnormal event detection device may receive and transmit at least two divided image sequences; or collect multiple videos of the same scene through independently arranged cameras located at different angles After that, it is transmitted to the abnormal event detection device, and a video received by the abnormal event detection device can be called an image sequence.
  • an image sequence may be a sequence within a time window, that is, image frames in the image sequence are temporally adjacent.
  • the acquisition method of the image sequence and the content of at least one frame of image included in the image sequence can be determined according to actual needs and application scenarios, and are not limited in the embodiment of the present disclosure.
  • FIG. 1 is a flow chart 1 of an abnormal event detection method shown in an embodiment of the present disclosure. As shown in FIG. 1 , the abnormal event detection method includes the following steps:
  • each image sequence includes at least one frame of image
  • the abnormal event detection device after the abnormal event detection device acquires at least two image sequences, it divides each image sequence into at least two scales, which means that each frame image included in the image sequence is divided into at least two scales .
  • the set of image blocks composed of image blocks at the same position in all image frames at the same scale.
  • the abnormal event detection device divides the video V into T non-overlapping image sequences For each image sequence, use R sets of different sliding window sizes for each frame image to divide.
  • pictorial Include a cube, which corresponds to a set of image blocks at this scale; Including 6 cubes, that is, corresponding to 6 sets of image blocks at this scale; 15 cubes are included in , that is, there are 15 image block sets corresponding to this scale.
  • a set of image blocks of the same scale can be expressed as Among them, N r is the number of image block sets corresponding to the scale. As shown in Figure 2, the division N r of the first scale is 1; the division N r of the second scale is 6; the division N r of the third scale is 15.
  • the sizes of image blocks corresponding to the same scale are the same.
  • the number of image blocks in each frame of image at the corresponding scale can be the ratio of the size of each frame of image to the size of the sliding window.
  • the abnormal event detection device after the abnormal event detection device obtains the set of image blocks of each image sequence, it can obtain the feature that can characterize each image sequence, and then obtain the correlation feature between the image sequences based on the feature that characterizes the image sequence.
  • each image sequence includes a total of 3 frames of images, and each frame of image includes a set of image blocks of 3 scales.
  • the number of features of the image sequence is: Multi-scale image per frame The number of divided image blocks*the number of image frames, that is, (1+6+15)*3, a total of 66 features.
  • the correlation feature between the image sequences obtained based on the characteristics of each image sequence can be called the temporal correlation feature .
  • the correlation features between different image block sets can be determined, and then The features of the image sequence are obtained based on the correlation features between different image block sets of the same scale. Or for each frame of image, the correlation features between the image blocks are obtained first, and then the features of the image sequence are obtained based on the correlation features between the image blocks.
  • the correlation feature can be characterized as spatial correlation.
  • the correlation features between the image sequences obtained based on the characteristics of each image sequence can be called spatio-temporal correlation feature.
  • the correlation feature between the at least two image sequences can be understood as a spatial correlation feature.
  • the correlation features between different image block sets of the same scale are obtained first, or the correlation features between multiple image blocks of a frame image, and then the correlation features of each image sequence are obtained based on the correlation features, Then the correlation features of the image sequence can be understood as features including local spatial correlation and global spatial correlation.
  • the local spatial correlation is associated with the position attribute of the image block
  • the global spatial correlation is associated with the acquisition angle attribute of the image sequence.
  • the correlation feature between image sequences is used to represent the correlation between image sequences, for example, it may include the features of each image sequence after weighting with different weights, and the distribution of weights reflects different Relationships between image sequences.
  • the correlation features of image sequences can also include features for any image sequence, and some features of other image sequences are fused, that is, the correlation between image sequences is reflected through feature fusion. It should be noted that, the present disclosure does not limit the manner of obtaining the correlation feature.
  • ⁇ ST includes the features corresponding to the T image sequences, but the features corresponding to each image sequence are equal to Correlations were performed based on features from other image sequences.
  • step S14 after the abnormal event detection device obtains the correlation features between the image sequences, it can use the correlation features, for example, to use traditional feature recognition methods or trained models in at least two image sequences A target image sequence in which anomalous events exist is identified.
  • the present disclosure performs multi-scale division for each frame of image in each image sequence, which can improve the robustness of abnormal event detection in the middle scale.
  • this disclosure determines the correlation features between image sequences based on the image block sets of each image sequence, such as the aforementioned time or space obtained through weight distribution, or the correlation in time and space, so that the abnormal event detection device can be used in Combined with the correlation between image sequences on a multi-scale basis, the detection accuracy of abnormal events is improved.
  • Fig. 3 is a flow chart 2 of an abnormal event detection method shown in an embodiment of the present disclosure. As shown in Fig. 3, step S13 in Fig. 1 may include the following steps:
  • step S13a after determining the image block set corresponding to the scale, the first feature including the correlation between the image block sets of the same scale can be obtained, as shown in Figure 2
  • the correlation features between each small cube block in It can be understood that since the image blocks in the image block sets have position attributes, each image block set also has position attributes, so the obtained first feature is a feature including the spatial correlation between the image block sets.
  • the first feature is used Indicates that the first feature corresponding to the scale obtained by the abnormal event detection device There are a total of R groups.
  • step S13b the first features corresponding to each scale in the same image sequence are fused to obtain the second feature of each image sequence. If there are T image sequences, the second feature is represented by ⁇ ′ t , then the abnormal event detection device Get T group ⁇ ′ t .
  • step S13c based on the second feature of each image sequence, the correlation feature between each image sequence is determined. Since the first feature includes the spatial correlation feature between image block sets, if at least two image sequences are the same For image sequences of different time periods of the video, the correlation feature between the image sequences obtained in this step may be a spatio-temporal correlation feature. In addition, similar to the aforementioned analysis, if at least two image sequences are image sequences from different angles of the same scene, the correlation features between the image sequences may also be features including local spatial correlation and global spatial correlation.
  • the image block set composed of image blocks at the same position in all frame images of the image sequence is used as a processing unit to obtain the first feature, without paying attention to one of each frame image Image blocks, then the amount of calculation can be relatively reduced when further obtaining the correlation features between the image sequences based on the first feature; and the obtained correlation features between the image block sets include multi-dimensional correlation features, so it can Improve the accuracy of abnormal event detection.
  • the obtaining the first feature corresponding to the scale based on the set of image blocks at the same scale includes:
  • the features of the image block set are obtained by taking the image block set as a whole, and then the features of the image block set of the same scale are spliced to obtain the first spliced feature corresponding to the scale.
  • the dimension of the feature corresponding to each image block set is D dimension
  • the first splicing feature is used said, then The dimension of is: the number of image block sets of the same scale*D, that is, N r *D.
  • performing feature extraction on each set of image blocks at the same scale to obtain features corresponding to the set of image blocks includes:
  • Feature extraction is performed on each of the image block sets at the same scale, and features corresponding to the image block sets including timing information between image blocks in the image block set are obtained.
  • the image frames in the image sequence are temporally adjacent, that is, there is time sequence information between the image frames in the image sequence, so there is also time sequence information between the image blocks in the image set.
  • features including timing information between image blocks in the image block set can be obtained.
  • the present disclosure may use a preset I3D feature encoder to perform feature extraction on each image block set at the same scale, so as to obtain features including timing information between image blocks in the image block set. It is understandable that since the I3D feature encoder has a deep network structure and uses a 3-dimensional convolution kernel, and the image block set contains timing information, the timing of the image block set can be obtained by using the 3-dimensional convolution kernel. Information is included to make feature extraction more complete.
  • the association relationship between image block sets of the same scale represented by the first mosaic feature can be constructed, so as to obtain the first feature corresponding to the scale.
  • the first feature corresponding to the scale can be Indicates that the dimension of the first feature corresponding to the obtained scale is the same as the dimension of the first stitching feature, except that the first feature includes the correlation between image sequences of the same scale,
  • the dimension of can also be N r *D.
  • the present disclosure uses the self-attention mechanism and convolution processing to construct the association relationship between the image block sets of the same scale represented by the first mosaic feature.
  • the obtained first feature can have a relatively high A good enhancement effect, for example, selectively highlights the interesting part (that is, the part that may have anomalies) in each image block set of the same scale, so as to further improve the detection effect of abnormal events.
  • the self-attention mechanism and convolution processing are used to construct the association relationship between the image block sets of the same scale represented by the first mosaic feature , to obtain the first feature corresponding to the scale, including:
  • a weight matrix is determined; wherein, the weight matrix includes: a weight value representing the probability of abnormality in each of the image block sets of the same scale;
  • the first features are obtained based on the weighted features, the convolutional features and the first concatenated features.
  • the weight matrix is first determined based on the self-attention mechanism.
  • the weight value in the weight matrix represents the probability of abnormalities in each image block set of the same scale. If the weight value is larger, it means that the image block set has The probability of abnormality is greater.
  • convolution processing is also performed on the first concatenated features, for example, non-atrous convolution or atrous convolution is used to process the first concatenated features.
  • convolution processing is also performed on the first stitching feature, because the first stitching feature includes the features of each image block set of the same scale, it is also possible to associate the features of multiple image block sets through the convolution operation of the convolution kernel .
  • the performing convolution processing on the first concatenated features to obtain the convolutional features includes:
  • At least two dilated convolution kernels to convolve the first concatenated features respectively, to obtain convolution results corresponding to each of the dilated convolution kernels; wherein at least two dilated convolution kernels have different dilation ratios;
  • the convolution results corresponding to the dilated convolution kernels are concatenated to obtain the convolutional features.
  • the first splicing feature is processed by dilated convolution, for example, at least two dilated convolution kernels include three, each dilated convolution kernel is a one-dimensional convolution kernel, and the dilated convolution kernels are respectively for 1, 2 and 4. If the first concatenated feature The dimension is N r *D, then use three hole convolution kernels for processing, the dimension of the convolution result corresponding to each hole convolution kernel can be N r *D/4, and each hole convolution kernel corresponds to The convolutional features obtained after splicing the convolution results are N r *3D/4.
  • the result after convolution can be used * ⁇ DC1,DC2,DC3 ⁇ indicates that DC1, DC2, and DC3 are the convolution results corresponding to the dilated convolution kernel.
  • the present disclosure is not limited to the above three one-dimensional dilated convolution kernels, because the final weighted feature, the convolved feature and the first concatenated feature need to cooperate together to form the first feature, so the dilation can be set according to actual needs The number, size, and corresponding hole ratio of convolution kernels.
  • the determining a weight matrix based on the self-attention mechanism and the first splicing feature includes:
  • the self-attention mechanism is used to determine the weight matrix by multiplying the first convolution result and the transpose of the second convolution result.
  • dimensionality reduction processing is first performed on the first concatenated features, so as to reduce the amount of subsequent calculations.
  • dimensionality reduction may be performed through a one-dimensional convolution matrix.
  • the first spliced feature after dimensionality reduction can be used Indicates that the dimension is N r *D/4.
  • the present disclosure is not limited to reducing the feature dimension of each image block set to 1/4 of the original feature dimension.
  • the self-attention mechanism is based on predicting the covariance between any image block set and other image block sets in the same scale, and each image block set is regarded as a random variable, and the weight in the obtained weight matrix is is the correlation of each image patch set with all image patch sets.
  • the preset first convolution kernel and the preset second convolution kernel can both be one-dimensional convolution kernels, and the dimensionality-reduced
  • the first concatenated features are convolved, and the obtained first convolution result and the second convolution result may both be one-dimensional vectors.
  • the product of the transposition of the first convolution result and the second convolution result, the attention map obtained by the normalized exponential function (softmax) of the self-attention mechanism is the weight matrix, and the weight matrix is essentially a covariance matrix.
  • the dimension of the first convolution result is N r *D/4
  • the dimension of the second convolution result is D/4*N r
  • the dimension of the weight matrix is N r *N r .
  • the obtaining weighted features based on the weight matrix and the first spliced features includes:
  • the weighted feature is determined as the sum of the convolution result of the weighted matrix and the preset fourth convolution kernel and the dimension-reduced first concatenated feature.
  • the preset third convolution kernel and the preset fourth convolution kernel may also be one-dimensional convolution kernels.
  • the third convolution result of is multiplied by the weight matrix.
  • Each item in the obtained weight matrix is the weighted sum of the image block set in the first stitching feature after dimensionality reduction, and the weight is in the first stitching feature after dimensionality reduction.
  • the dimension of the third convolution result may be N r *D/4
  • the dimension of the weighting matrix may be N r *D/4
  • the dimension of the weighted feature may be N r *D/4.
  • the weighted matrix is convolved with the preset fourth convolution sum, and the first concatenated feature after dimensionality reduction is summed, that is, the residual connection is performed, and the obtained weighted feature pair
  • the representation ability of each image block set is stronger.
  • W ⁇ is the preset first convolution kernel
  • W g is the preset third convolution kernel
  • W z is the preset fourth convolution kernel
  • the weight matrix is the weighting matrix, is the weighted feature.
  • the obtaining the first feature based on the weighted feature, the convolutional feature and the first concatenated feature includes:
  • the first feature can be represented by the following formula (3):
  • Figure 4 is an example diagram of the principle of obtaining the first feature based on the first splicing feature in the embodiment of the present disclosure.
  • the branch identified by L41 on the right is based on the self-attention mechanism and the first splicing feature Determine the weight matrix M, and then based on the weight matrix M and the first stitching feature Get weighted features The process;
  • the branch identified by L42 on the left is the first splicing feature based on dilated convolution Perform the process of obtaining the convolutional feature N r *3D/4, and the weighted feature After splicing with the convolutional feature N r *3D/4, and the first splicing feature Addition, that is, to obtain the first feature shown in Figure 4
  • the branch identified by L41 on the right is based on the self-attention mechanism and the first splicing feature Determine the weight matrix M, and then based on the weight matrix M and the first stitching feature Get weighted features The process;
  • the branch identified by L42 on the left is
  • the merging of the first features corresponding to each scale in the same image sequence to obtain the second feature of each image sequence includes:
  • the one-dimensional feature vectors of each scale are accumulated to obtain the second feature of each of the image sequences.
  • the first feature corresponding to the scale is obtained after splicing a set of image blocks of the same scale, and the first feature corresponding to the scale has the same dimension as the first stitching feature
  • the first feature can be understood as It is the result of horizontal stitching of correlation features of image block sets of the same scale. Since the image blocks included in the image block sets have position attributes, the disclosure can reconstruct according to the positional relationship of the image blocks in each image block set to obtain the reconstruction features corresponding to the scale. It can be understood that,
  • the reconstructed feature is a three-dimensional vector, which can be used in the embodiment of the present disclosure To represent. Each element in the reconstructed feature represents a set of image blocks, and the feature dimension is D.
  • the reconstructed feature is converted into a one-dimensional feature vector through the preset fifth convolution kernel and the fully connected layer, wherein the preset fifth convolution
  • the kernel can be a two-dimensional convolution kernel, which is used to perform feature dimensionality reduction convolution processing on the reconstructed features.
  • the features after two-dimensional convolution are transformed into one-dimensional feature vectors after the fully connected layer can be used Indicates that its feature dimension can be D-dimensional. It can be understood that the one-dimensional feature vector is a feature representing a set of image blocks of the same scale.
  • the second feature of the image sequence is obtained by accumulating the one-dimensional feature vectors of each scale, it can be understood that the second feature of the image sequence is a fusion of multi-scale features.
  • Fig. 5 is an example diagram of the principle of feature fusion in an embodiment of the present disclosure. It takes the first feature corresponding to a scale as an example to illustrate. As shown in Fig. 5, the dotted line box L51a shows the first feature corresponding to a scale. feature, the first feature includes the correlation between image block sets of the same scale.
  • the cube L52a in the illustration represents the reconstructed feature obtained after reconstructing the first feature according to the positional relationship of the image blocks in each image block set. After the reconstructed features pass through the two-dimensional convolutional layer L53a and the fully connected layer L54a, the reconstructed features are converted into one-dimensional feature vectors.
  • each first feature corresponds to a reconstructed feature
  • each reconstructed feature is transformed into a one-dimensional feature vector through a two-dimensional convolutional layer and a fully connected layer, and then accumulated to obtain L50, that is, the image sequence corresponds to the second characteristic.
  • the preset fifth convolution kernel of the present disclosure may be included in the two-dimensional convolution layer.
  • L53a, L53b, and L53c shown in FIG. 5 may be the same two-dimensional convolutional layer, and L54a, L54b, and L54c may also be the same fully connected layer, which is not limited by this embodiment of the present disclosure.
  • the abnormal event detection device can have a local to overall perception of the image frames in the image sequence, thus improving the detection of abnormalities of different scales. event robustness.
  • the determining the correlation feature between each of the image sequences based on the second feature of each of the image sequences includes:
  • the correlation feature between image sequences may be determined based on the acquisition manner of the correlation feature between image block sets of the same scale, that is, the acquisition manner of the first feature corresponding to the scale.
  • the second feature of each image sequence can be spliced, for example, spliced in a horizontal manner to obtain the second splicing feature, and then based on the principle of the aforementioned Figure 4, based on the self-attention mechanism and the second splicing feature, A weight matrix of the image sequence is determined, and the weight matrix of the image sequence includes: a weight value representing a probability of abnormality in each image sequence. Subsequently, weighted features corresponding to all image sequences are obtained based on the weight matrix of the image sequences and the second concatenated features.
  • dimensionality reduction processing may be performed on the second spliced features first, for example, dimensionality reduction processing is performed by using one-dimensional convolution.
  • convolution processing is performed on the second spliced features to obtain the convolutional features corresponding to all image sequences, and further the weighted features corresponding to all image sequences, the convolutional features corresponding to all image sequences, and the first
  • the second stitching feature determines the correlation feature between each image sequence.
  • T groups of image sequences share T groups of ⁇ ′ t , and the second features of each image sequence are spliced to obtain the second spliced feature. express.
  • W ⁇ , W g and W z can refer to the description in the aforementioned formulas (1) and (2), and the softmax part obtains the weight matrix of the image sequence; is the weighting matrix corresponding to all image sequences, The weighted features belonging to all image sequences; ⁇ *, A is the convolutional feature corresponding to all image sequences, and ⁇ ST is used to represent the correlation features between image sequences.
  • the dimension of ⁇ ST may be the number of image sequences*the feature dimension of each image sequence, that is, T*D dimension.
  • Fig. 6 is a flowchart three of an abnormal event detection method in an embodiment of the present disclosure. As shown in Fig. 3, step S14 in Fig. 1 may include the following steps:
  • S14b Determine the target image sequence in which the abnormal event exists according to the prediction result of each image sequence.
  • a target image sequence with an abnormal event can be determined from at least two image sequences by using a traditional feature recognition method or a trained model according to the correlation features between the image sequences.
  • a pre-trained anomaly detection model obtained through weakly supervised training is used.
  • the loss function is used to estimate the degree of inconsistency between the predicted value of the model and the real value. Usually, the smaller the value of the loss function, the better the Lubang property of the model.
  • the parameters of the model can be adjusted through the constraints of the loss function to train and obtain a better model.
  • the characteristics of the training samples are obtained for the training samples according to the descriptions in FIGS. A better model for detection.
  • the initial model is, for example, a convolutional neural network (Convolutional Neural Networks, CNN) model, a deep neural network (Deep Neural Networks, DNN) model, etc., which are not limited here.
  • the method also includes:
  • K sample image sequences with larger feature gradients For the positive samples and negative samples in the training sample set, respectively select K sample image sequences with larger feature gradients to calculate the average feature gradient; wherein, the K is a positive integer greater than 1;
  • the preset abnormality prediction model is obtained through training based on the loss function.
  • the training sample set includes positive samples and negative samples, wherein a positive sample refers to a sample that does not have abnormal events in the image sequence included in the sample, and a negative sample refers to a sample that contains an abnormal event in the image sequence included in the sample.
  • a sample can be a video, and the video is divided into different image sequences.
  • a video corresponds to a label, but the image sequence has no label.
  • each video can be compared to a "package”, and the image sequence can be compared to an "instance”, that is, the "package” is labeled, but the "instance" has no label.
  • K sample image sequences with large feature gradients are respectively selected to calculate the average feature gradient, and then based on the average feature gradient corresponding to the positive sample and the average feature gradient corresponding to the negative sample, construct loss function.
  • the ranking loss is calculated according to the following formula (8):
  • g( ⁇ ST ′ + ) is the average feature gradient of the first K image sequences in the normal video
  • g( ⁇ ST ′ - ) is the average feature gradient of the first K image sequences in the abnormal video.
  • s represents the predicted abnormal score
  • y represents the label corresponding to the video
  • the label value of abnormal video is 1
  • the label value of normal video is 0.
  • ⁇ fm , ⁇ 1 , and ⁇ 2 are factors used to balance various losses, represents a sparse constraint, Represents a temporal smoothing constraint.
  • a loss function can be constructed based on the above steps to preset an anomaly detection model. After the correlation feature ⁇ ST between each image sequence is input into the preset anomaly detection model, the prediction result of each image sequence can be obtained. For example, the prediction result is a prediction score.
  • This disclosure uses each prediction score and the preset score threshold For comparison, for example, an image sequence with a prediction score greater than a preset score threshold is determined as a target image sequence with an abnormal event.
  • the present disclosure uses the abnormal event detection model obtained based on the weak supervision training method to process the correlation features of the image sequence to determine the target image sequence with abnormal events.
  • the preset abnormal event detection The generalization ability of the model is better; in addition, compared with the model obtained through unsupervised training, since the supervised training method is guided by the training label, the accuracy of abnormal event detection is better.
  • FIG. 7 is a flowchart four of an abnormal event detection method in an embodiment of the present disclosure. As shown in FIG. 7, step S11 in FIG. 1 may include the following steps:
  • the image frame earlier in time is determined as the last frame of the image sequence, and the image frame later in time is determined as the last frame of the image sequence.
  • At least two image sequences are from the same video, that is, the video to be detected.
  • the disclosure uses clustering to detect the difference values between adjacent frame images in the video to be detected, and uses some image frames with similar content in the image frames as an image sequence, through This way can make the content of each image sequence not repeated, improve the difference between different image sequences, and thus can improve the accuracy of abnormal location.
  • the difference value may be determined by taking a difference between two adjacent frame images, but the present disclosure does not limit this method.
  • the manner in which the abnormal event detection device acquires at least two image sequences in the present disclosure is not limited to the manner in this embodiment, and may also be, for example, dividing the video into image sequences of equal duration based on time, etc., which will not be described in detail here. stated.
  • FIG. 8A is a schematic diagram of an abnormal event detection method shown in an embodiment of the present disclosure
  • FIG. 8B is a schematic diagram of a processing process of some modules in FIG. 8A shown in an embodiment of the present disclosure.
  • the video segment identified by L81 in FIG. 8A is an image sequence, and three image sequences are shown in total.
  • a patch obtained is a set of image blocks mentioned in this disclosure.
  • the patch spatial relationship modeling can be performed based on the module identified by L84.
  • the corresponding scales are obtained can include multiple sets of image blocks. corresponding to the scale After passing through the pre-trained feature encoder identified by L3, the first splicing feature corresponding to the scale is obtained Then, the first splicing feature corresponding to the scale can be modeled through the patch spatial relationship identified by L84 to obtain the correlation between the image block sets of the same scale, that is, the first feature corresponding to the scale, as shown in Figure 8B
  • the patch aggregation module identified by L85 can be used to combine the first features of different scales of the same image sequence Splicing is performed to obtain the second feature corresponding to the image sequence, that is, one of the T feature segments shown in L86 in FIG.
  • the second features of all image sequences that is, the T feature segments shown in L86
  • the video time relationship module identified by L87 to obtain the features after spatio-temporal modeling, that is, the distance between the image sequences mentioned in this disclosure Relevant features.
  • the correlation feature is input into the pre-trained classifier L88 to obtain the prediction score of each image sequence, and based on the prediction score of each image sequence, it can be determined whether there is an abnormal event in the image sequence.
  • the pre-trained classifier can be obtained based on the weakly supervised training method, and the loss function of the model is constructed by the video-level labels of the training samples and the prediction scores of the training samples, and the model parameters are fixed when the loss meets the convergence conditions to obtain trained classifier.
  • FIG. 9 is a diagram of an abnormal event detection device according to an embodiment of the present disclosure.
  • an abnormal event detection device 900 includes:
  • the acquisition module 901 is configured to acquire at least two image sequences; wherein, each of the image sequences includes at least one frame of image;
  • the division module 902 is configured to divide each image sequence into at least two scales to obtain an image block set composed of image blocks at the same position in all image frames under the same scale;
  • the first determination module 903 is configured to determine the correlation feature between each of the image sequences based on the image block set of each of the image sequences;
  • the second determination module 904 is configured to determine a target image sequence in which an abnormal event exists in the at least two image sequences according to the correlation characteristics between the image sequences.
  • the first determining module 903 is configured to, for each image sequence, obtain a first feature corresponding to a scale based on a set of image blocks at the same scale; wherein, among the first features Including the correlation between image block sets of the same scale; merging the first features corresponding to each scale in the same image sequence to obtain the second feature of each image sequence; based on each of the image sequences
  • the second feature of the method is to determine the correlation feature between each of the image sequences.
  • the first determination module 903 is configured to perform feature extraction on each of the image block sets at the same scale to obtain features corresponding to the image block sets; The features are stitched together to obtain the first stitching feature corresponding to the scale; based on the first stitching feature corresponding to the scale, a set of image blocks of the same scale represented by the first stitching feature is constructed by using the self-attention mechanism and convolution processing The correlation between the scales is obtained to obtain the first feature corresponding to the scale.
  • the first determination module 903 is configured to determine a weight matrix based on the self-attention mechanism and the first splicing feature; wherein, the weight matrix includes: The weight value of the probability of abnormality in the image block set; based on the weight matrix and the first splicing feature, obtain the weighted feature; perform convolution processing on the first splicing feature to obtain the convolved feature; The first features are obtained based on the weighted features, the convolutional features and the first concatenated features.
  • the first determining module 903 is configured to perform dimension reduction processing on the first concatenated features to obtain the dimensionally reduced first concatenated features; for the dimensionally reduced first concatenated features, use Preset the first convolution kernel to perform convolution to obtain a first convolution result; use the preset second convolution kernel to perform convolution on the first concatenated feature after the dimensionality reduction to obtain a second convolution result;
  • the self-attention mechanism is used to determine the weight matrix by using the transposed multiplication result of the first convolution result and the second convolution result.
  • the first determination module 903 is configured to use a preset third convolution kernel to convolve the dimensionally reduced first spliced features to obtain a third convolution result; the weight matrix Multiplying the third convolution result to obtain a weighted matrix; the result of convolution of the weighted matrix and the preset fourth convolution kernel, and the sum of the dimensionally reduced first concatenated features, Determined as the weighted features.
  • the first determination module 903 is configured to use at least two dilated convolution kernels to convolve the first concatenated features respectively, to obtain convolution results corresponding to each dilated convolution kernel; Wherein, at least two atrous convolution kernels have different atrous ratios; the convolution results corresponding to the atrous convolution kernels are spliced to obtain the convolutional features.
  • the first determination module 903 is configured to concatenate the weighted feature and the convolved feature, and then add the first concatenated feature to obtain the first feature .
  • the first determination module 903 is configured to perform feature extraction on each of the image block sets at the same scale, and obtain the image block corresponding to the image block set including each image block in the image block set. The characteristics of the time series information between.
  • the first determination module 903 is configured to reconstruct the first features of the same scale according to the positional relationship of each of the image block sets to obtain reconstructed features corresponding to the scale;
  • the reconstructed features corresponding to the scales are convolved with the preset fifth convolution kernel, and then converted into one-dimensional feature vectors through a fully connected layer; the one-dimensional feature vectors of each scale are accumulated to obtain the image sequence of each Second feature.
  • the first determination module 903 is configured to concatenate the second features of each of the image sequences to obtain a second concatenated feature; based on the second concatenated feature, based on a self-attention mechanism And the convolution processing constructs the correlation between different image sequences represented by the second splicing feature, and determines the correlation feature between each of the image sequences.
  • the second determining module 904 is configured to detect the correlation features between each of the image sequences based on a preset abnormality prediction model, and obtain the prediction results of each of the image sequences; wherein, the The preset anomaly prediction model is a model obtained by training using a weakly supervised training method; according to the prediction results of each of the image sequences, the target image sequence in which the abnormal event exists is determined.
  • the device further includes: a calculation module 905 configured to select K sample image sequences with relatively large feature gradients to calculate the average feature gradient for the positive samples and negative samples in the training sample set; wherein, the K is a positive integer greater than 1; the construction module 906 is configured to construct a loss function based on the average feature gradient corresponding to the positive sample and the average feature gradient corresponding to the negative sample; the training module 907 is configured to be based on the loss The function training obtains the preset abnormality prediction model.
  • a calculation module 905 configured to select K sample image sequences with relatively large feature gradients to calculate the average feature gradient for the positive samples and negative samples in the training sample set
  • the K is a positive integer greater than 1
  • the construction module 906 is configured to construct a loss function based on the average feature gradient corresponding to the positive sample and the average feature gradient corresponding to the negative sample
  • the training module 907 is configured to be based on the loss The function training obtains the preset abnormality prediction model.
  • the acquisition module 901 is configured to acquire the video to be detected; determine the difference value between adjacent frame images in the video to be detected; Among adjacent frame images, the image frame earlier in time is determined as the last frame of the image sequence, and the image frame later in time is determined as the first frame of the image sequence adjacent to the image sequence.
  • an embodiment of the present disclosure provides a computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the above method when executing the program.
  • an embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above method are implemented.
  • the computer readable storage medium may be transitory or non-transitory.
  • an embodiment of the present disclosure provides a computer program product
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and when the computer program is read and executed by a computer, the above method can be implemented. some or all of the steps.
  • the computer program product can be realized by hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • FIG. 10 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure.
  • the hardware entity of the computer device 1000 includes: a processor 1001, a communication interface 1002, and a memory 1003, wherein:
  • Processor 1001 generally controls the overall operation of computer device 1000 .
  • the communication interface 1002 enables the computer device to communicate with other terminals or servers through the network.
  • the memory 1003 is configured to store instructions and applications executable by the processor 1001, and can also cache data to be processed or processed by the processor 1001 and various modules in the computer device 1000 (for example, image data, audio data, voice communication data and Video communication data) can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM). Data transmission may be performed between the processor 1001 , the communication interface 1002 and the memory 1003 through the bus 1004 .
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are schematic.
  • the division of the units is a logical function division.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integration
  • the unit can be realized in the form of hardware or in the form of hardware plus software functional unit.
  • the above-mentioned integrated units of the present disclosure are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions to make a A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure.
  • a computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium.
  • a computer readable storage medium may be - but is not limited to - an electrical storage device, magnetic storage device, optical storage device, electromagnetic storage device, semiconductor storage device, or any suitable combination of the foregoing. Examples (non-exhaustive list) of computer readable storage media include: portable computer discs, hard drives, random access memory (RAM), read only memory (Read Only Memory, ROM), erasable programmable read only memory, Memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • Computer-readable storage media as used herein is not to be interpreted as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开是关于一种异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品。该方法包括:获取至少两个图像序列;其中,每一所述图像序列中包括至少一帧图像;对每一所述图像序列进行至少两种尺度的划分,得到同一尺度下所有图像帧中同一位置的图像块组成的图像块集合;基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征;根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列。通过该方法,能提升异常事件检测的准确性。

Description

异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品
相关申请的交叉引用
本公开实施例基于申请号为202210103096.9、申请日为2022年01月27日、申请名称为“异常事件检测方法及装置、计算机设备、存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及计算机视觉技术领域,尤其涉及一种异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品。
背景技术
视频异常检测方法旨在捕捉视频中的异常事件并确定其发生的时间区间,异常事件指不符合预期的、极少出现的行为。如何提升异常事件检测的准确性,一直以来备受关注。
发明内容
本公开提供一种异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品。
本公开实施例提供一种异常事件检测方法,包括:获取至少两个图像序列;其中,每一所述图像序列中包括至少一帧图像;对每一所述图像序列进行至少两种尺度的划分,得到同一尺度下所有图像帧中同一位置的图像块组成的图像块集合;基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征;根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列。
本公开实施例提供一种异常事件检测装置,包括:获取模块,配置为获取至少两个图像序列;其中,每一所述图像序列中包括至少一帧图像;划分模块,配置为对每一所述图像序列进行至少两种尺度的划分,得到同一尺度下所有图像帧中同一位置的图像块组成的图像块集合;第一确定模块,配置为基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征;第二确定模块,配置为根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列。
本公开实施例提供一种计算机设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行如上述第一方面中所述的异常事件检测方法。
本公开实施例提供一种存储介质,包括:当所述存储介质中的指令由设备的处理器执行时,使得设备能够执行如上述第一方面中所述的异常事件检测方法。
本公开实施例提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码被计算机读取并执行的情况下,实现本公开任一实施例中的方法的部分或全部步 骤。
本公开实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现本公开任一实施例中的方法的部分或全部步骤。
本公开的实施例提供的技术方案可以包括以下有益效果:
在本公开的实施例中,考虑到一些异常事件发生在图像帧中很小的区域,而有些异常事件可能贯穿了整个画面,因而直接将图像帧作为整体或者进行单一尺度的区域划分均无法应对各种异常事件,因而本公开针对各图像序列中的每帧图像进行多尺度的划分,能提升异常事件检测中尺度的鲁棒性。此外,本公开基于各图像序列的图像块集合,确定图像序列之间的相关性特征,使得异常事件检测装置能在多尺度基础上结合图像序列之间的关联性,提升对异常事件的检测精度。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
图1为本公开实施例示出的一种异常事件检测方法流程图一;
图2为本公开实施例示出的一种尺度划分的示例图;
图3为本公开实施例示出的一种异常事件检测方法流程图二;
图4为本公开实施例中基于第一拼接特征获得第一特征的原理示例图;
图5为本公开实施例中一种特征融合的原理示例图;
图6为本公开实施例中一种异常事件检测方法流程图三;
图7为本公开实施例中一种异常事件检测方法流程图四;
图8A为本公开实施例示出的一种异常事件检测方法原理图;
图8B为本公开实施例示出的图8A中部分模块的处理过程示意图;
图9为本公开实施例示出的一种异常事件检测装置图;
图10为本公开实施例中计算机设备的一种硬件实体示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
本公开实施例提供的异常事件检测方法,其执行主体可以是异常事件检测装置,例如,异常事件检测方法可以由终端设备或服务器或其它电子设备执行,其中,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备 等。在一些可能的实现方式中,异常事件检测方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
在本公开的实施例中,异常事件检测装置可以包括图像采集组件,从而利用图像采集组件采集某一场景的连续帧图像,并划分得到至少两个图像序列。例如,图像采集组件是摄像头,可采集某一固定位置的视频,包括该摄像头的异常事件检测装置可在时间维度上将该视频划分为至少两个图像序列,一个图像序列可称之为一个视频片段,不同视频片段之间所包括的图像帧可不重叠。此外,异常事件检测装置也可以不包括图像采集组件,异常事件检测装置可接收传输的已划分好的至少两个图像序列;或者,通过独立设置的位于不同角度的摄像头采集同一场景的多个视频后,传输至异常事件检测装置,异常事件检测装置接收到的一个视频即可称之为一个图像序列。本公开实施例中,一个图像序列可以是一个时间窗口内的序列,即图像序列中的各图像帧在时间上相邻。
需要说明的是,本公开实施例中,图像序列的获取方式,以及图像序列中所包括的至少一帧图像的内容,可以根据实际需求和应用场景确定,本公开实施例不作限定。
图1为本公开实施例示出的一种异常事件检测方法流程图一,如图1所示,异常事件检测方法包括以下步骤:
S11、获取至少两个图像序列;其中,每一所述图像序列中包括至少一帧图像;
S12、对每一所述图像序列进行至少两种尺度的划分,得到同一尺度下所有图像帧中同一位置的图像块组成的图像块集合;
S13、基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征;
S14、根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列。
本公式实施例中,异常事件检测装置获取至少两个图像序列后,针对每一图像序列进行至少两种尺度的划分,是指对图像序列所包括的每一帧图像进行至少两种尺度的划分。对图像序列进行多尺度划分后,同一尺度下所有图像帧中同一位置的图像块组成的图像块集合。
示例性的,异常事件检测装置将视频V划分为不重叠的T个图像序列
Figure PCTCN2022127087-appb-000001
针对各图像序列,对每一帧图像利用R组不同的滑动窗口尺寸
Figure PCTCN2022127087-appb-000002
进行划分。图2为本公开实施例示出的一种尺度划分的示例图,如图2所示,某一图像序列进行了3种尺度(R=3)的划分,划分后每帧图像中对应的图像块数量分别为L21标识的1个、L22标识的6个和L23标识的15个。本公开的实施例中,将图像序列的不同图像帧中同位置的图像块作为整体形成一个图像块集合,即图2中所示的一个小立方体。图示的
Figure PCTCN2022127087-appb-000003
中包括一个立方体,即该尺度下对应一个图像块集合;
Figure PCTCN2022127087-appb-000004
中包括6个立方体,即该尺度下对应6个图像块集合;
Figure PCTCN2022127087-appb-000005
中包括15个立方体,即该尺度下对应15个图像块集合。
本公开实施例中,同一尺度的图像块集合可表示为
Figure PCTCN2022127087-appb-000006
其中,N r为尺度对应的图像块集合的数量。如图2所示,第一种尺度的划分N r为1;第二种尺度的划分N r为6;第三种尺度的划分N r为15。
需要说明的是,本公开的实施例中,对图像序列中每一帧图像进行对应尺度的划分时,同一尺度对应的图像块的尺寸相同。此外,在利用滑动窗口将每一帧图像分割成互不重叠的图像块时,每一帧图像在对应尺度下的图像块的数量,可以为对每帧图像的尺寸与滑动窗口的尺寸的比值进行向下取整后的结果,即本公开在做图像块划分时,当帧图像的尺寸与滑动 窗口的尺寸无法整除时,不会额外通过例如补“0”或补“1”的方式获得图像块,本公开中每个图像块的内容均属于划分前的帧图像中的内容。
本步骤S13中,异常事件检测装置得到各图像序列的图像块集合后,即可获得可表征每一图像序列的特征,再基于表征图像序列的特征,获得图像序列之间的相关性特征。
在一种实施例中,基于图像序列的各图像块集合,获得可表征图像序列的特征时,例如,可直接将不同尺度的多个图像块集合中各图像块的特征拼接后作为图像序列的特征。如前述示例,每个图像序列共包括3帧图像,每帧图像均包括3种尺度的图像块集合,假设每个图像块对应一个特征,则该图像序列的特征数量为:每帧图像多尺度划分后的图像块个数*图像帧数,即(1+6+15)*3共66个特征。在该实施例中,若至少两个图像序列是对同一视频在时间维度上划分获得的,则基于各图像序列的特征获得的图像序列之间的相关性特征,可称之为时间相关性特征。
在另一种实施例中,基于图像序列的各图像块集合,获得可表征图像序列的特征时,例如可针对同一尺度的不同图像块集合,确定不同图像块集合之间的相关性特征,再基于同一尺度的不同图像块集合之间的相关性特征获得图像序列的特征。或者针对每帧图像,先获得图像块之间的相关性特征,再基于图像块之间的相关性特征获得图像序列的特征。
可以理解的是,由于图像块带有位置属性,则无论是同一尺度的不同图像块集合之间的相关性特征,还是一帧图像内的多个图像块之间的相关性特征,都带有空间属性,该相关性特征可表征为空间上的相关性。在本公开的实施例中,若至少两个图像序列是对同一视频在时间维度上划分获得,则基于各图像序列的特征获得的图像序列之间的相关性特征,可称之为时空相关性特征。
当然,若本公开实施例中至少两个图像序列是同一场景的不同角度的图像序列,则至少两个图像序列之间的相关性特征可理解为空间相关性特征。此外,若先获得同一尺度的不同图像块集合之间的相关性特征,或一帧图像的多个图像块之间的相关性特征,再基于该相关性特征获得各图像序列的相关性特征,则该图像序列的相关性特征可理解为包括局部空间相关性以及全局空间相关性的特征。其中,局部空间相关性关联图像块的位置属性,全局空间相关性关联图像序列的获取角度属性。
需要说明的是,图像序列之间的相关性特征,用于表征图像序列之间的关联关系,例如可以包括对各图像序列的特征利用不同权重进行加权处理后的特征,通过权重的分配体现不同图像序列之间的关联关系。此外,图像序列的相关性特征,还可以包括针对任一图像序列的特征,融合其他图像序列的部分特征,即图像序列之间的关联关系通过特征融合来体现。需要说明的是,本公开对相关性特征的获取方式不做限制。
在本公开的实施例中,若有T个图像序列,图像序列之间的相关性特征用φ ST表示,则φ ST中包括T个图像序列所对应的特征,只是各图像序列对应的特征均基于其他图像序列的特征进行了相关性处理。
在步骤S14中,异常事件检测装置在获得各图像序列之间的相关性特征后,即可根据该相关性特征,例如采用传统的特征识别方法或已训练好的模型在至少两个图像序列中确定出存在异常事件的目标图像序列。
可以理解的是,在本公开的实施例中,考虑到一些异常事件发生在图像帧中很小的区域,而有些异常事件可能贯穿了整个画面,因而直接将图像帧作为整体或者进行单一尺度的区域 划分均无法应对各种异常事件,因而本公开针对各图像序列中的每帧图像进行多尺度的划分,能提升异常事件检测中尺度的鲁棒性。此外,本公开基于各图像序列的图像块集合,确定图像序列之间的相关性特征,例如前述的通过权重分配获得时间或空间,再或者是时空上的关联性,使得异常事件检测装置能在多尺度基础上结合图像序列之间的关联性,提升对异常事件的检测精度。
图3为本公开实施例示出的一种异常事件检测方法流程图二,如图3所示,图1中的步骤S13可包括如下步骤:
S13a、针对每一所述图像序列,基于同一尺度下的各图像块集合,获得尺度对应的第一特征;其中,所述第一特征中包括同一尺度的各图像块集合之间的相关性;
S13b、将同一所述图像序列中各尺度对应的所述第一特征融合,得到每一所述图像序列的第二特征;
S13c、基于各所述图像序列的所述第二特征,确定各所述图像序列之间的所述相关性特征。
在步骤S13a中,在确定尺度对应的图像块集合后,即可获得包括同尺度的各图像块集合之间的相关性的第一特征,如获得图2所示的
Figure PCTCN2022127087-appb-000007
中每个小立方体块的之间的相关性特征。可以理解的是,由于图像块集合中的图像块带有位置属性,则各图像块集合也带有位置属性,因而获得的第一特征是包括图像块集合之间的空间相关性的特征。
示例性的,若异常事件检测装置进行了R组的尺度划分,第一特征用
Figure PCTCN2022127087-appb-000008
表示,则异常事件检测装置获得的尺度对应的第一特征
Figure PCTCN2022127087-appb-000009
共有R组。
在步骤S13b中,将同一图像序列中各尺度对应的第一特征融合,得到每一图像序列的第二特征,若有T个图像序列,第二特征用φ′ t表示,则异常事件检测装置得到T组φ′ t
在步骤S13c中,基于各图像序列的第二特征,确定各图像序列之间的相关性特征,由于第一特征是包括图像块集合之间的空间相关性特征,若至少两个图像序列是同一视频不同时间段的图像序列,则在该步骤获得的图像序列之间的相关性特征可以是时空相关性特征。此外,同前述分析,若至少两个图像序列是同一场景的不同角度的图像序列,则图像序列之间的相关性特征也可以是包括局部空间相关性以及全局空间相关性的特征。
可以理解的是,在本公开的实施例中,以图像序列的所有帧图像中同一位置的图像块组成的图像块集合作为处理单元来获得第一特征,而不关注于每一帧图像的一个图像块,那么在基于第一特征进一步获得图像序列之间的相关性特征时能相对减少计算量;且获得的各图像块集合之间的相关性特征包括了多维度的相关性特征,因而能提升异常事件检测的精准度。
在一种实施例中,所述基于同一尺度下的各图像块集合,获得尺度对应的第一特征,包括:
对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的特征;
将同一尺度的所述图像块集合的特征进行拼接,获得尺度对应的第一拼接特征;
基于所述尺度对应的第一拼接特征,利用自注意力机制和卷积处理构建所述第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,获得所述尺度对应的第一特征。
在该实施例中,以图像块集合为整体获得图像块集合的特征,然后将同一尺度的图像块集合的特征进行拼接,获得尺度对应的第一拼接特征。
在将同一尺度的图像块集合的特征进行拼接时,可将图像块集合作为整体进行水平拼接。示例性的,若经过特征提取后,每个图像块集合对应的特征的维度是D维,第一拼接特征用
Figure PCTCN2022127087-appb-000010
表示,则
Figure PCTCN2022127087-appb-000011
的维度是:同一尺度的图像块集合的数量*D,即N r*D。
在一种实施例中,所述对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的特征,包括:
对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的包括所述图像块集合中各图像块之间的时序信息的特征。
如前所述的,图像序列中的各图像帧在时间上相邻,即图像序列内图像帧之间有时序信息,因而图像集合中的各图像块之间也有时序信息。在该实施例中,在对图像块集合做特征提取时,可获得包括图像块集合中各图像块之间的时序信息的特征。
示例性的,本公开可利用预设I3D特征编码器对同一尺度下的各图像块集合做特征提取,以获得包括图像块集合中各图像块之间的时序信息在内的特征。可以理解的是,由于I3D特征编码器的网络结构较深,且使用的是3维卷积核,而图像块集合又是包含了时序信息,因而利用3维卷积核能将图像块集合的时序信息包含进来,使得特征提取更完备。
本公开实施例中,在获得尺度对应的第一拼接特征后,即可构建第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,从而得到尺度对应的第一特征。
需要说明的是,在本公开的实施例中,如前所述的,尺度对应的第一特征可用
Figure PCTCN2022127087-appb-000012
表示,那么得到的尺度对应的第一特征的维度与第一拼接特征的维度相同,只是第一特征中包括了同一尺度的图像序列之间的相关性,
Figure PCTCN2022127087-appb-000013
的维度也可以是N r*D。
可以理解的是,本公开通过自注意力机制以及卷积处理构建第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,基于机器视觉理论,能使得获得的第一特征具有较好的增强效果,例如有选择地突出同一尺度的各图像块集合中感兴趣的部分(即可能存在异常的部分),从而能进一步提升异常事件的检测效果。
在一种实施例中,所述基于所述尺度对应的第一拼接特征,利用自注意力机制和卷积处理构建所述第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,获得所述尺度对应的第一特征,包括:
基于所述自注意力机制以及所述第一拼接特征,确定权重矩阵;其中,所述权重矩阵中包括:表征同一尺度的各所述图像块集合存在异常的概率的权重值;
基于所述权重矩阵以及所述第一拼接特征,获得加权后的特征;
对所述第一拼接特征进行卷积处理,获得卷积后的特征;
基于所述加权后的特征、卷积后的特征以及所述第一拼接特征,获得所述第一特征。
在该实施例中,先基于自注意力机制确定权重矩阵,权重矩阵中的权重值表征的是同一尺度的各图像块集合存在异常的概率,若权重值越大,则说明该图像块集合存在异常的概率越大。
该实施例中,还对第一拼接特征进行卷积处理,例如采用非空洞卷积或空洞卷积的方式对第一拼接特征进行处理。在对第一拼接特征进行卷积处理时,因第一拼接特征中包括了同一尺度的各图像块集合的特征,因而也可通过卷积核的卷积操作,关联多个图像块集合的特征。
在一种实施例中,所述对所述第一拼接特征进行卷积处理,获得卷积后的特征,包括:
利用至少二个空洞卷积核分别对所述第一拼接特征进行卷积,获得各所述空洞卷积核对应的卷积结果;其中,至少二个所述空洞卷积核的空洞率不同;
将各所述空洞卷积核对应的卷积结果进行拼接,获得所述卷积后的特征。
在该实施例中,采用空洞卷积的方式对第一拼接特征进行处理,例如,至少二个空洞卷积核包括三个,每个空洞卷积核均为一维卷积核,空洞率分别为1、2和4。若第一拼接特征
Figure PCTCN2022127087-appb-000014
的维度是N r*D,则利用三个空洞卷积核进行处理后的,每个空洞卷积核对应的卷积结果的维度可以是N r*D/4,将各空洞卷积核对应的卷积结果进行拼接后获得的卷积后的特征为N r*3D/4。
本公开实施例中,卷积后的结果可用
Figure PCTCN2022127087-appb-000015
*∈{DC1,DC2,DC3}表示,其中,DC1、DC2和DC3分别为空洞卷积核对应的卷积结果。
当然,本公开并不限定于上述3个一维空洞卷积核,因最终加权后的特征、卷积后的特征以及第一拼接特征需共同配合形成第一特征,因此可根据实际需要设置空洞卷积核的个数、尺寸以及对应的空洞率。
可以理解的是,由于空洞卷积能扩大感受野,且多个带有不同空洞率的空洞卷积核叠加时,不同的感受野会带来多尺度信息,因而经过多个空洞卷积核进行卷积,并将卷积结果进行拼接后获得的空洞卷积后的特征使得第一拼接特征得到增强。
在一种实施例中,所述基于所述自注意力机制以及所述第一拼接特征,确定权重矩阵,包括:
对所述第一拼接特征进行降维处理,获得降维后的第一拼接特征;
对所述降维后的第一拼接特征利用预设第一卷积核进行卷积,获得第一卷积结果;
对所述降维后的第一拼接特征利用预设第二卷积核进行卷积,获得第二卷积结果;
将所述第一卷积结果和所述第二卷积结果的转置相乘后的结果利用所述自注意力机制,确定所述权重矩阵。
在本公开的实施例中,首先对第一拼接特征进行降维处理,以减少后续计算量。示例性的,可通过一维卷积矩阵进行降维。示例性的,降维后的第一拼接特征可以用
Figure PCTCN2022127087-appb-000016
表示,维度为N r*D/4。当然,本公开并不限定于将每一图像块集合的特征维度降为原特征维度的1/4。
本公开实施例中,自注意力机制基于预测同一尺度中任一图像块集合与其他图像块集合之间的协方差,将每个图像块集合视为随机变量,获得的权重矩阵中的权值是每个图像块集合与所有图像块集合的相关。
在该实施例中,预设第一卷积核和预设第二卷积核可均为一维卷积核,利用预设第一卷积核和预设第二卷积核对降维后的第一拼接特征做卷积,获得的第一卷积结果和第二卷积结果,可均为一维向量。第一卷积结果和第二卷积结果的转置的乘积,经自注意力机制的归一化指数函数(softmax)获得的注意力图即权重矩阵,该权重矩阵实质是协方差矩阵。
示例性的,若第一卷积结果的维度是N r*D/4,第二卷积结果的维度是D/4*N r,则权重矩阵的维度是N r*N r
在一种实施例中,所述基于所述权重矩阵以及所述第一拼接特征,获得加权后的特征,包括:
利用预设第三卷积核对所述降维后的第一拼接特征进行卷积,获得第三卷积结果;
将所述权重矩阵和所述第三卷积结果相乘,获得加权矩阵;
将所述加权矩阵和预设第四卷积核进行卷积后的结果,与所述降维后的第一拼接特征的和值,确定为所述加权后的特征。
在该实施例中,预设第三卷积核和预设第四卷积核也可以是一维卷积核,利用预设第三卷积核对降维后的第一拼接特征进行卷积后的第三卷积结果与权重矩阵相乘,获得的加权矩阵中的每一项都是降维后的第一拼接特征中图像块集合的加权和,权重是降维后的第一拼接特征中所包括的同一尺度的各图像块集合之间的协方差。
示例性的,第三卷积结果的维度可以为N r*D/4,加权矩阵的维度为N r*D/4,加权后的特征的维度可以为N r*D/4。
本公开的实施例中,将加权矩阵与预设第四卷积和进行卷积后的结果,与降维后的第一拼接特征加和,即进行残差连接,获得的加权后的特征对各图像块集合的表征能力更强。
本公开中,上述获得权重矩阵以及加权后的特征的过程可通过如下公式(1)和(2)来表示:
Figure PCTCN2022127087-appb-000017
Figure PCTCN2022127087-appb-000018
上述公式(1)和(2)中,W θ为预设第一卷积核,
Figure PCTCN2022127087-appb-000019
为预设第二卷积核,W g为预设第三卷积核,W z为预设第四卷积核,
Figure PCTCN2022127087-appb-000020
是降维后的第一拼接特征。softmax部分得到的即是权重矩阵,
Figure PCTCN2022127087-appb-000021
即为加权矩阵,
Figure PCTCN2022127087-appb-000022
即为加权后的特征。
在一种实施例中,所述基于所述加权后的特征、卷积后的特征以及所述第一拼接特征,获得所述第一特征,包括:
将所述加权后的特征与所述卷积后的特征拼接后,与所述第一拼接特征加和,获得所述第一特征。
在该实施例中,第一特征可通过如下公式(3)表示:
Figure PCTCN2022127087-appb-000023
其中,
Figure PCTCN2022127087-appb-000024
即为加权后的特征,
Figure PCTCN2022127087-appb-000025
为卷积后的结果,
Figure PCTCN2022127087-appb-000026
为第一拼接特征,
Figure PCTCN2022127087-appb-000027
为第一特征,维度是N r*D。
图4为本公开实施例中基于第一拼接特征获得第一特征的原理示例图,如图4所示,右侧L41所标识的分支为基于自注意力机制和第一拼接特征
Figure PCTCN2022127087-appb-000028
确定权重矩阵M,再基于权重矩阵M和第一拼接特征
Figure PCTCN2022127087-appb-000029
获得加权后的特征
Figure PCTCN2022127087-appb-000030
的过程;左侧L42所标识的分支为基于空洞卷积对第一拼接特征
Figure PCTCN2022127087-appb-000031
进行处理获得卷积后的特征N r*3D/4的过程,将加权后的特征
Figure PCTCN2022127087-appb-000032
与卷积后的特征N r*3D/4拼接后,与第一拼接特征
Figure PCTCN2022127087-appb-000033
加和,即获得图4所示的第一特征
Figure PCTCN2022127087-appb-000034
上述过程可参照前述描述。
在一种实施例中,所述将同一所述图像序列中各尺度对应的所述第一特征融合,得到每一所述图像序列的第二特征,包括:
将同一尺度的所述第一特征,按各所述图像块集合的位置关系进行重构,获得尺度对应的重构特征;
将所述尺度对应的重构特征利用预设第五卷积核卷积后,经全连接层转化为一维特征向量;
将各尺度的所述一维特征向量累加,得到每一所述图像序列的第二特征。
在该实施例中,由于尺度对应的第一特征,是基于将同尺度的图像块集合进行拼接后获 得,且尺度对应的第一特征与第一拼接特征的维度相同,可将第一特征理解为同尺度的图像块集合的相关性特征水平拼接后的结果。由于图像块集合中所包括的图像块是带有位置属性的,因而本公开可按各图像块集合中图像块的位置关系进行重构,获得尺度对应的重构特征,可以理解的而是,该重构特征是一个三维向量,在本公开的实施例中可用
Figure PCTCN2022127087-appb-000035
来表示。该重构特征中的每一个元素表征一个图像块集合,特征维度为D维。
在基于图像块集合中图像块的位置关系重构获得重构特征后,通过预设第五卷积核和全连接层将该重构特征转化成一维特征向量,其中,预设第五卷积核可为二维卷积核,用于对重构特征进行特征降维的卷积处理,经二维卷积后的特征经全连接层后转换后的一维特征向量可用
Figure PCTCN2022127087-appb-000036
表示,其特征维度可为D维。可以理解的是,该一维特征向量即为表征同尺度的图像块集合的特征。
由于图像序列的第二特征是通过各尺度的一维特征向量进行累加后获得的,可以理解的是,该图像序列的第二特征是融合了多尺度的特征。
图5为本公开实施例中一种特征融合的原理示例图,以一个尺度对应的第一特征为例进行说明,如图5所示,虚线框L51a所示的即为一个尺度对应的第一特征,该第一特征中包括同一尺度的各图像块集合之间的相关性。图示中的立方体L52a即表征将该第一特征按各图像块集合中图像块的位置关系进行重构后获得的重构特征。重构特征通过二维卷积层L53a和全连接层L54a后将重构特征转化为的一维特征向量。如图5所示,每一个第一特征对应一个重构特征,通过二维卷积层和全连接层将各重构特征转化为的一维特征向量后进行累加后得到L50,即图像序列对应的第二特征。其中,二维卷积层中可包括本公开的预设第五卷积核。需要说明的是,图5所示的L53a、L53b以及L53c可以是同样的二维卷积层,L54a、L54b以及L54c也可以是同样的全连接层,对此本公开实施例不做限制。
可以理解的是,本公开通过将所有尺度的图像块集合的特征融合起来,使得异常事件检测装置能对图像序列中的图像帧有一个从局部到整体的感知,因而能提升对不同尺度的异常事件的鲁棒性。
在一种实施例中,所述基于各所述图像序列的所述第二特征,确定各所述图像序列之间的所述相关性特征,包括:
将各所述图像序列的所述第二特征进行拼接,获得第二拼接特征;
基于所述第二拼接特征,构建所述第二拼接特征所表征的不同图像序列之间的关联关系,确定各所述图像序列之间的所述相关性特征。
在本公开的实施例中,可基于同一尺度的图像块集合之间相关性特征的获取方式,即尺度对应的第一特征的获取方式,确定各图像序列之间的相关性特征。
在该实施例中,可将各图像序列的第二特征进行拼接,例如按水平方式进行拼接,获得第二拼接特征,然后基于前述图4的原理,基于自注意力机制以及第二拼接特征,确定图像序列的权重矩阵,图像序列的权重矩阵中包括:表征各图像序列存在异常的概率的权重值。随后,基于图像序列的权重矩阵以及第二拼接特征,获得所有图像序列对应的加权后的特征。其中,在基于自注意力机制进行处理时,可先对第二拼接特征进行降维处理,例如利用一维卷积进行降维处理。此外,还对第二拼接特征进行卷积处理,获得所有图像序列对应的卷积后的特征,并进一步将所有图像序列对应的加权后的特征、所有图像序列对应的卷积后的特征以及第二拼接特征,确定各图像序列之间的相关性特征。
示例性的,若图像序列的第二特征用φ t表示,则T组图像序列共有T组φ t,将各图像序列的第二特征进行拼接获得第二拼接特征可用
Figure PCTCN2022127087-appb-000037
表示。
上述过程可通过如下公式(4)-(6)表示:
Figure PCTCN2022127087-appb-000038
Figure PCTCN2022127087-appb-000039
Figure PCTCN2022127087-appb-000040
其中,
Figure PCTCN2022127087-appb-000041
为对第二拼接特征进行降维后的特征;W θ
Figure PCTCN2022127087-appb-000042
W g以及W z可参考前述公式(1)和(2)中的描述,softmax部分得到的为图像序列的权重矩阵;
Figure PCTCN2022127087-appb-000043
为所有图像序列对应的加权矩阵,
Figure PCTCN2022127087-appb-000044
属于所有图像序列的加权后的特征;φ *,A则是所有图像序列对应的卷积后的特征,φ ST用于表示各图像序列之间的相关性特征。
需要说明的是,在本公开的实施例中,φ ST的维度可以是图像序列个数*每个图像序列的特征维数,即为T*D维。
图6为本公开实施例中一种异常事件检测方法流程图三,如图3所示,图1中的步骤S14可包括如下步骤:
S14a、基于预设异常预测模型对各所述图像序列之间的相关性特征进行检测,获得各所述图像序列的预测结果;其中,所述预设异常预测模型为采用弱监督训练方法训练获得的模型;
S14b、根据各所述图像序列的预测结果,确定存在所述异常事件的所述目标图像序列。
如前所述的,可根据各图像序列之间的相关性特征,采用传统的特征识别方法或已训练好的模型在至少两个图像序列中确定出存在异常事件的目标图像序列。在该实施例中,采用的是事先训练好的通过弱监督训练方式获得的异常检测模型。
在进行弱监督训练时,需构建损失函数,损失函数用来估计模型的预测值与真实值之间的不一致程度,通常损失函数值越小,模型的鲁邦性越好。在训练过程中,可通过对损失函数的约束调整模型的参数,以训练获得较优的模型。
在本公开的实施例中,对训练样本按前述图1至图5中的描述获得训练样本的特征,然后基于获得的训练样本的特征和样本标签构建损失函数,不断修正模型的参数,以获得检测效果更好的模型。本公开实施例中,初始模型例如是卷积神经网络(Convolutional Neural Networks,CNN)模型、深度神经网络(Deep Neural Networks,DNN)模型等,此处不做限制。
在一种实施例中,所述方法还包括:
对训练样本集中的正样本和负样本,分别选取特征梯度较大的K个样本图像序列计算平均特征梯度;其中,所述K为大于1的正整数;
根据所述正样本对应的平均特征梯度,以及所述负样本对应的平均特征梯度,构建损失函数;
基于所述损失函数训练获得所述预设异常预测模型。
在本公开实施例中,训练样本集中包括正样本和负样本,其中,正样本是指样本所包括的图像序列中不存在异常事件的样本,负样本是指样本所包含的图像序列中存在异常事件的样本。一个样本可以是一个视频,视频又被划分为不同的图像序列,一个视频对应一个标签,但图像序列没有标签。本公开实施例中,可将每个视频比作成一个“包”,图像序列比作“实 例”,即“包”是有标签的,但是“实例”没有标签。
本公开实施例中,对正样本和负样本,分别选取特征梯度较大的K个样本图像序列计算平均特征梯度,再基于正样本对应的平均特征梯度,以及负样本对应的平均特征梯度,构建损失函数。
假若对训练样本中一个视频所包括的T个图像序列基于前述方法获得的样本特征为
Figure PCTCN2022127087-appb-000045
构建损失函数的方法如下:
A、从所有图像序列中挑选出特征梯度较大的前K个图像序列,按如下公式(7)计算平均特征梯度:
Figure PCTCN2022127087-appb-000046
其中,||φ″ t|| 2为特征的2范数,本公开中特征梯度通过计算特征的2范数获得。
B、基于视频标签标识的异常视频φ ST+和正常视频φ ST-,按如下公式(8)计算排序损失:
Figure PCTCN2022127087-appb-000047
其中,g(φ ST+)为正常视频中前K个图像序列的平均特征梯度,g(φ ST-)为异常视频中前K个图像序列的平均特征梯度。
C、将各视频所包括的前K个图像序列的特征输入到原始模型预测异常分数得到
Figure PCTCN2022127087-appb-000048
(一个图像序列对应一个预测分数),基于预测的异常分数以及视频对应的标签,计算交叉熵损失,如下公式(9)所示:
Figure PCTCN2022127087-appb-000049
其中,s代表预测的异常分数,y代表视频对应的标签,例如异常视频的标签值为1,正常视频的标签值为0。
D、引入稀疏约束和时间平滑约束,确定总损失函数为如下公式(10):
Figure PCTCN2022127087-appb-000050
其中,λ fm12是用于平衡各项损失的因子,
Figure PCTCN2022127087-appb-000051
表示稀疏约束,
Figure PCTCN2022127087-appb-000052
表示时间平滑约束。
本公开可基于上述步骤构建损失函数从而预设异常检测模型。将各图像序列之间的相关性特征φ ST输入到预设异常检测模型后,即可得到各图像序列的预测结果,例如该预测结果是预测分数,本公开将各预测分数和预设分数阈值进行比较,例如将预测分数大于预设分数阈值的图像序列确定为存在异常事件的目标图像序列。
可以理解的是,本公开利用基于弱监督训练方法获得的异常事件检测模型对图像序列的相关性特征进行处理以确定存在异常事件的目标图像序列的方式,相对于传统方法,预设异常事件检测模型的泛化能力要更好;此外,相对于通过无监督方法训练获得的模型,由于若监督训练方式有训练标签的指导,因而对异常事件检测的准确性要更优。
图7为本公开实施例中一种异常事件检测方法流程图四,如图7所示,图1中的步骤S11可包括如下步骤:
S11a、获取待检测视频;
S11b、确定所述待检测视频中相邻帧图像之间的差异值;
S11c、将所述差异值大于预设差异阈值的所述相邻帧图像中,时间靠前的图像帧确定为 一个所述图像序列的尾帧,时间靠后的图像帧确定为与一个所述图像序列相邻的图像序列的首帧。
在该实施例中,至少两个图像序列来自于同一视频,即待检测视频。在基于待检测视频划分图像序列时,本公开通过聚类的方式,检测待检测视频中相邻帧图像之间的差异值,将图像帧中内容较相似的一些图像帧作为一个图像序列,通过该种方式能使得各图像序列之间的内容不重复,提高不同图像序列之间的差异性,因而能提升异常定位的准确性。
需要说明的是,本公开在确定待检测视频中相邻帧图像之间的差异值时,例如可以是将相邻两帧图像做差分来确定该差异值,但本公开并不限定该方式。此外,本公开中异常事件检测装置获取至少二个图像序列的方式也并不限定于该实施例的方式,还可以是例如基于时间将视频划分为等时长的图像序列等,此处不再详述。
图8A为本公开实施例示出的一种异常事件检测方法原理图,图8B为本公开实施例示出的图8A中部分模块的处理过程示意图。图8A中L81标识的视频片段即图像序列,共示出了3个图像序列。将每个图像序列输入多尺度补丁生成器L82后,得到的一个补丁即为本公开提及的一个图像块集合。将各补丁输入预训练特征编码器L83提取特征后,即可基于L84标识的模块进行补丁空间关系建模。如图8B所示,针对一个图像序列,将图像序列输入多尺度补丁生成器L82(共R组尺度)后,得到尺度对应的
Figure PCTCN2022127087-appb-000053
中可包括多个图像块集合。针对尺度对应的
Figure PCTCN2022127087-appb-000054
通过L3标识的预训练特征编码器后获得尺度对应的第一拼接特征
Figure PCTCN2022127087-appb-000055
随后将尺度对应的第一拼接特征通过L84标识的补丁空间关系建模可获得同一尺度的各图像块集合之间的相关性,即尺度对应的第一特征,如图8B中所示的
Figure PCTCN2022127087-appb-000056
针对各尺度对应的补丁空间关系建模后的第一特征,即可通过L85标识的补丁聚合模块将同一图像序列的不同尺度的第一特征
Figure PCTCN2022127087-appb-000057
进行拼接,即获得图像序列对应的第二特征,也就是图8A中L86所示的T特征片段中的一个。随后,将所有图像序列的第二特征,即L86中所示的T个特征片段通过L87所标识的视频时间关系模块后获得时空建模后的特征,即本公开提及的图像序列之间的相关性特征。最后,将该相关性特征输入到预训练好的分类器L88即可得到各图像序列的预测分数,基于各图像序列的预测分数即可确定该图像序列是否存在异常事件。其中,预训练好的分类器可以是基于弱监督训练方法获得的,通过训练样本的视频级标签以及训练样本的预测分数来构建模型的损失函数,并在损失满足收敛条件时固定模型参数从而得到训练好的分类器。
图9为本公开实施例示出的一种异常事件检测装置图。参照图9,异常事件检测装置900包括:
获取模块901,配置为获取至少两个图像序列;其中,每一所述图像序列中包括至少一帧图像;
划分模块902,配置为对每一所述图像序列进行至少两种尺度的划分,得到同一尺度下所有图像帧中同一位置的图像块组成的图像块集合;
第一确定模块903,配置为基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征;
第二确定模块904,配置为根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列。
在一些实施例中,所述第一确定模块903,配置为针对每一所述图像序列,基于同一尺 度下的各图像块集合,获得尺度对应的第一特征;其中,所述第一特征中包括同一尺度的各图像块集合之间的相关性;将同一所述图像序列中各尺度对应的所述第一特征融合,得到每一所述图像序列的第二特征;基于各所述图像序列的所述第二特征,确定各所述图像序列之间的所述相关性特征。
在一些实施例中,所述第一确定模块903,配置为对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的特征;将同一尺度的所述图像块集合的特征进行拼接,获得尺度对应的第一拼接特征;基于所述尺度对应的第一拼接特征,利用自注意力机制和卷积处理构建所述第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,获得所述尺度对应的第一特征。
在一些实施例中,所述第一确定模块903,配置为基于所述自注意力机制以及所述第一拼接特征,确定权重矩阵;其中,所述权重矩阵中包括:表征同一尺度的各所述图像块集合存在异常的概率的权重值;基于所述权重矩阵以及所述第一拼接特征,获得加权后的特征;对所述第一拼接特征进行卷积处理,获得卷积后的特征;基于所述加权后的特征、卷积后的特征以及所述第一拼接特征,获得所述第一特征。
在一些实施例中,所述第一确定模块903,配置为对所述第一拼接特征进行降维处理,获得降维后的第一拼接特征;对所述降维后的第一拼接特征利用预设第一卷积核进行卷积,获得第一卷积结果;对所述降维后的第一拼接特征利用预设第二卷积核进行卷积,获得第二卷积结果;将所述第一卷积结果和所述第二卷积结果的转置相乘后的结果利用所述自注意力机制,确定所述权重矩阵。
在一些实施例中,所述第一确定模块903,配置为利用预设第三卷积核对所述降维后的第一拼接特征进行卷积,获得第三卷积结果;将所述权重矩阵和所述第三卷积结果相乘,获得加权矩阵;将所述加权矩阵和预设第四卷积核进行卷积后的结果,与所述降维后的第一拼接特征的和值,确定为所述加权后的特征。
在一些实施例中,所述第一确定模块903,配置为利用至少二个空洞卷积核分别对所述第一拼接特征进行卷积,获得各所述空洞卷积核对应的卷积结果;其中,至少二个所述空洞卷积核的空洞率不同;将各所述空洞卷积核对应的卷积结果进行拼接,获得所述卷积后的特征。
在一些实施例中,所述第一确定模块903,配置为将所述加权后的特征与所述卷积后的特征拼接后,与所述第一拼接特征加和,获得所述第一特征。
在一些实施例中,所述第一确定模块903,配置为对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的包括所述图像块集合中各图像块之间的时序信息的特征。
在一些实施例中,所述第一确定模块903,配置为将同一尺度的所述第一特征,按各所述图像块集合的位置关系进行重构,获得尺度对应的重构特征;将所述尺度对应的重构特征利用预设第五卷积核卷积后,经全连接层转化为一维特征向量;将各尺度的所述一维特征向量累加,得到每一所述图像序列的第二特征。
在一些实施例中,所述第一确定模块903,配置为将各所述图像序列的所述第二特征进行拼接,获得第二拼接特征;基于所述第二拼接特征,基于自注意力机制以及卷积处理构建所述第二拼接特征所表征的不同图像序列之间的关联关系,确定各所述图像序列之间的所述 相关性特征。
在一些实施例中,所述第二确定模块904,配置为基于预设异常预测模型对各所述图像序列之间的相关性特征进行检测,获得各所述图像序列的预测结果;其中,所述预设异常预测模型为采用弱监督训练方法训练获得的模型;根据各所述图像序列的预测结果,确定存在所述异常事件的所述目标图像序列。
在一些实施例中,所述装置还包括:计算模块905,配置为对训练样本集中的正样本和负样本,分别选取特征梯度较大的K个样本图像序列计算平均特征梯度;其中,所述K为大于1的正整数;构建模块906,配置为根据所述正样本对应的平均特征梯度,以及所述负样本对应的平均特征梯度,构建损失函数;训练模块907,配置为基于所述损失函数训练获得所述预设异常预测模型。
在一些实施例中,所述获取模块901,配置为获取待检测视频;确定所述待检测视频中相邻帧图像之间的差异值;将所述差异值大于预设差异阈值的所述相邻帧图像中,时间靠前的图像帧确定为一个所述图像序列的尾帧,时间靠后的图像帧确定为与一个所述图像序列相邻的图像序列的首帧。
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。
对应地,本公开实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法中的步骤。
对应地,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方法中的步骤。所述计算机可读存储介质可以是瞬时性的,也可以是非瞬时性的。
对应地,本公开实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。该计算机程序产品可以通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品体现为计算机存储介质,在另一个可选实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
这里需要指出的是:以上存储介质、计算机程序产品和设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开存储介质、计算机程序产品和设备实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。
需要说明的是,图10为本公开实施例中计算机设备的一种硬件实体示意图,如图10所示,该计算机设备1000的硬件实体包括:处理器1001、通信接口1002和存储器1003,其中:
处理器1001通常控制计算机设备1000的总体操作。
通信接口1002可以使计算机设备通过网络与其他终端或服务器通信。
存储器1003配置为存储由处理器1001可执行的指令和应用,还可以缓存待处理器1001以及计算机设备1000中各模块待处理或已经处理的数据(例如,图像数据、音频数据、语音通信数据和视频通信数据),可以通过闪存(FLASH)或随机访问存储器(Random Access Memory,RAM)实现。处理器1001、通信接口1002和存储器1003之间可以通过总线1004 进行数据传输。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本公开的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本公开的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本公开实施例的实施过程构成任何限定。上述本公开实施例序号为了描述,不代表实施例的优劣。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例是示意性的,例如,所述单元的划分,为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本公开各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读存储介质中,该程序在执行时,执行包括上述方法实施例的步骤。
或者,本公开上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读存储介质中。基于这样的理解,本公开的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备,可为易失性存储介质或者非易失性存储介质。计算机可读存储介质可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述 的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其它自由传播的电磁波、通过波导或其它传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
以上所述,仅为本公开的实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (32)

  1. 一种异常事件检测方法,所述方法包括:
    获取至少两个图像序列;其中,每一所述图像序列中包括至少一帧图像;
    对每一所述图像序列进行至少两种尺度的划分,得到同一尺度下所有图像帧中同一位置的图像块组成的图像块集合;
    基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征;
    根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列。
  2. 根据权利要求1所述的方法,其中,所述基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征,包括:
    针对每一所述图像序列,基于同一尺度下的各图像块集合,获得尺度对应的第一特征;其中,所述第一特征中包括同一尺度的各图像块集合之间的相关性;
    将同一所述图像序列中各尺度对应的所述第一特征融合,得到每一所述图像序列的第二特征;
    基于各所述图像序列的所述第二特征,确定各所述图像序列之间的所述相关性特征。
  3. 根据权利要求2所述的方法,其中,所述基于同一尺度下的各图像块集合,获得尺度对应的第一特征,包括:
    对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的特征;
    将同一尺度的所述图像块集合的特征进行拼接,获得尺度对应的第一拼接特征;
    基于所述尺度对应的第一拼接特征,利用自注意力机制和卷积处理构建所述第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,获得所述尺度对应的第一特征。
  4. 根据权利要求3所述的方法,其中,所述基于所述尺度对应的第一拼接特征,利用自注意力机制和卷积处理构建所述第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,获得所述尺度对应的第一特征,包括:
    基于所述自注意力机制以及所述第一拼接特征,确定权重矩阵;其中,所述权重矩阵中包括:表征同一尺度的各所述图像块集合存在异常的概率的权重值;
    基于所述权重矩阵以及所述第一拼接特征,获得加权后的特征;
    对所述第一拼接特征进行卷积处理,获得卷积后的特征;
    基于所述加权后的特征、卷积后的特征以及所述第一拼接特征,获得所述第一特征。
  5. 根据权利要求4所述的方法,其中,所述基于所述自注意力机制以及所述第一拼接特征,确定权重矩阵,包括:
    对所述第一拼接特征进行降维处理,获得降维后的第一拼接特征;
    对所述降维后的第一拼接特征利用预设第一卷积核进行卷积,获得第一卷积结果;
    对所述降维后的第一拼接特征利用预设第二卷积核进行卷积,获得第二卷积结果;
    将所述第一卷积结果和所述第二卷积结果的转置相乘后的结果利用所述自注意力机制,确定所述权重矩阵。
  6. 根据权利要求5所述的方法,其中,所述基于所述权重矩阵以及所述第一拼接特征,获得加权后的特征,包括:
    利用预设第三卷积核对所述降维后的第一拼接特征进行卷积,获得第三卷积结果;
    将所述权重矩阵和所述第三卷积结果相乘,获得加权矩阵;
    将所述加权矩阵和预设第四卷积核进行卷积后的结果,与所述降维后的第一拼接特征的和值,确定为所述加权后的特征。
  7. 根据权利要求4所述的方法,其中,所述对所述第一拼接特征进行卷积进处理,获得卷积后的特征,包括:
    利用至少二个空洞卷积核分别对所述第一拼接特征进行卷积,获得各所述空洞卷积核对应的卷积结果;其中,至少二个所述空洞卷积核的空洞率不同;
    将各所述空洞卷积核对应的卷积结果进行拼接,获得所述卷积后的特征。
  8. 根据权利要求4所述的方法,其中,所述基于所述加权后的特征、卷积后的特征以及所述第一拼接特征,获得所述第一特征,包括:
    将所述加权后的特征与所述卷积后的特征拼接后,与所述第一拼接特征加和,获得所述第一特征。
  9. 根据权利要求3所述的方法,其中,所述对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的特征,包括:
    对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的包括所述图像块集合中各图像块之间的时序信息的特征。
  10. 根据权利要求2所述的方法,其中,所述将同一所述图像序列中各尺度对应的所述第一特征融合,得到每一所述图像序列的第二特征,包括:
    将同一尺度的所述第一特征,按各所述图像块集合的位置关系进行重构,获得尺度对应的重构特征;
    将所述尺度对应的重构特征利用预设第五卷积核卷积后,经全连接层转化为一维特征向量;
    将各尺度的所述一维特征向量累加,得到每一所述图像序列的第二特征。
  11. 根据权利要求2所述的方法,其中,所述基于各所述图像序列的所述第二特征,确定各所述图像序列之间的所述相关性特征,包括:
    将各所述图像序列的所述第二特征进行拼接,获得第二拼接特征;
    基于所述第二拼接特征,基于自注意力机制以及空洞卷积构建所述第二拼接特征所表征的不同图像序列之间的关联关系,确定各所述图像序列之间的所述相关性特征。
  12. 根据权利要求1至11中任一项所述的方法,其中,所述根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列,包括:
    基于预设异常预测模型对各所述图像序列之间的相关性特征进行检测,获得各所述图像序列的预测结果;其中,所述预设异常预测模型为采用弱监督训练方法训练获得的模型;
    根据各所述图像序列的预测结果,确定存在所述异常事件的所述目标图像序列。
  13. 根据权利要求12所述的方法,其中,所述方法还包括:
    对训练样本集中的正样本和负样本,分别选取特征梯度较大的K个样本图像序列计算平均特征梯度;其中,所述K为大于1的正整数;
    根据所述正样本对应的平均特征梯度,以及所述负样本对应的平均特征梯度,构建损失函数;
    基于所述损失函数训练获得所述预设异常预测模型。
  14. 根据权利要求1至11中任一项所述的方法,其中,所述获取至少两个图像序列,包括:
    获取待检测视频;
    确定所述待检测视频中相邻帧图像之间的差异值;
    将所述差异值大于预设差异阈值的所述相邻帧图像中,时间靠前的图像帧确定为一个所述图像序列的尾帧,时间靠后的图像帧确定为与一个所述图像序列相邻的图像序列的首帧。
  15. 一种异常事件检测装置,所述装置包括:
    获取模块,配置为获取至少两个图像序列;其中,每一所述图像序列中包括至少一帧图像;
    划分模块,配置为对每一所述图像序列进行至少两种尺度的划分,得到同一尺度下所有图像帧中同一位置的图像块组成的图像块集合;
    第一确定模块,配置为基于各所述图像序列的图像块集合,确定各所述图像序列之间的相关性特征;
    第二确定模块,配置为根据各所述图像序列之间的相关性特征,在所述至少两个图像序列中确定出存在异常事件的目标图像序列。
  16. 根据权利要求15所述的装置,其中,所述第一确定模块,配置为针对每一所述图像序列,基于同一尺度下的各图像块集合,获得尺度对应的第一特征;其中,所述第一特征中包括同一尺度的各图像块集合之间的相关性;将同一所述图像序列中各尺度对应的所述第一特征融合,得到每一所述图像序列的第二特征;基于各所述图像序列的所述第二特征,确定各所述图像序列之间的所述相关性特征。
  17. 根据权利要求16所述的装置,其中,所述第一确定模块,配置为对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的特征;将同一尺度的所述图像块集合的特征进行拼接,获得尺度对应的第一拼接特征;基于所述尺度对应的第一拼接特征,利用自注意力机制和卷积处理构建所述第一拼接特征所表征的同一尺度的图像块集合之间的关联关系,获得所述尺度对应的第一特征。
  18. 根据权利要求17所述的装置,其中,所述第一确定模块,配置为基于所述自注意力机制以及所述第一拼接特征,确定权重矩阵;其中,所述权重矩阵中包括:表征同一尺度的各所述图像块集合存在异常的概率的权重值;基于所述权重矩阵以及所述第一拼接特征,获得加权后的特征;对所述第一拼接特征进行卷积处理,获得卷积后的特征;基于所述加权后的特征、卷积后的特征以及所述第一拼接特征,获得所述第一特征。
  19. 根据权利要求18所述的装置,其中,所述第一确定模块,配置为对所述第一拼接特征进行降维处理,获得降维后的第一拼接特征;对所述降维后的第一拼接特征利用预设第一卷积核进行卷积,获得第一卷积结果;对所述降维后的第一拼接特征利用预设第二卷积核进行卷积,获得第二卷积结果;将所述第一卷积结果和所述第二卷积结果的转置相乘后的结果利用所述自注意力机制,确定所述权重矩阵。
  20. 根据权利要求19所述的装置,其中,所述第一确定模块,配置为利用预设第三卷积核对所述降维后的第一拼接特征进行卷积,获得第三卷积结果;将所述权重矩阵和所述第三卷积结果相乘,获得加权矩阵;将所述加权矩阵和预设第四卷积核进行卷积后的结果,与 所述降维后的第一拼接特征的和值,确定为所述加权后的特征。
  21. 根据权利要求19所述的装置,其中,所述第一确定模块,配置为利用至少二个空洞卷积核分别对所述第一拼接特征进行卷积,获得各所述空洞卷积核对应的卷积结果;其中,至少二个所述空洞卷积核的空洞率不同;将各所述空洞卷积核对应的卷积结果进行拼接,获得所述卷积后的特征。
  22. 根据权利要求19所述的装置,其中,所述第一确定模块,配置为将所述加权后的特征与所述卷积后的特征拼接后,与所述第一拼接特征加和,获得所述第一特征。
  23. 根据权利要求18所述的装置,其中,所述第一确定模块,配置为对同一尺度下的各所述图像块集合做特征提取,获得所述图像块集合对应的包括所述图像块集合中各图像块之间的时序信息的特征。
  24. 根据权利要求17所述的装置,其中,所述第一确定模块,配置为将同一尺度的所述第一特征,按各所述图像块集合的位置关系进行重构,获得尺度对应的重构特征;将所述尺度对应的重构特征利用预设第五卷积核卷积后,经全连接层转化为一维特征向量;将各尺度的所述一维特征向量累加,得到每一所述图像序列的第二特征。
  25. 根据权利要求17所述的装置,其中,所述第一确定模块,配置为将各所述图像序列的所述第二特征进行拼接,获得第二拼接特征;基于所述第二拼接特征,基于自注意力机制以及卷积处理构建所述第二拼接特征所表征的不同图像序列之间的关联关系,确定各所述图像序列之间的所述相关性特征。
  26. 根据权利要求15至17中任一项所述的装置,其中,所述第二确定模块,配置为基于预设异常预测模型对各所述图像序列之间的相关性特征进行检测,获得各所述图像序列的预测结果;其中,所述预设异常预测模型为采用弱监督训练方法训练获得的模型;根据各所述图像序列的预测结果,确定存在所述异常事件的所述目标图像序列。
  27. 根据权利要求26所述的装置,其中,所述装置还包括:计算模块,配置为对训练样本集中的正样本和负样本,分别选取特征梯度较大的K个样本图像序列计算平均特征梯度;其中,所述K为大于1的正整数;构建模块,配置为根据所述正样本对应的平均特征梯度,以及所述负样本对应的平均特征梯度,构建损失函数;训练模块,配置为基于所述损失函数训练获得所述预设异常预测模型。
  28. 根据权利要求15至25所述的装置,其中,所述获取模块,配置为获取待检测视频;确定所述待检测视频中相邻帧图像之间的差异值;将所述差异值大于预设差异阈值的所述相邻帧图像中,时间靠前的图像帧确定为一个所述图像序列的尾帧,时间靠后的图像帧确定为与一个所述图像序列相邻的图像序列的首帧。
  29. 一种计算机设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行如权利要求1至14中任一项所述的异常事件检测方法。
  30. 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至14中任一项所述的异常事件检测方法。
  31. 一种计算机程序,包括计算机可读代码,在计算机可读代码在设备上运行的情况下,设备中的处理器执行配置为实现权利要求1至14中任意一项所述的方法。
  32. 一种计算机程序产品,配置为存储计算机可读指令,所述计算机可读指令被执行时使得计算机执行权利要求1至14中任意一项所述的方法。
PCT/CN2022/127087 2022-01-27 2022-10-24 异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品 WO2023142550A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210103096.9A CN114511810A (zh) 2022-01-27 2022-01-27 异常事件检测方法及装置、计算机设备、存储介质
CN202210103096.9 2022-01-27

Publications (1)

Publication Number Publication Date
WO2023142550A1 true WO2023142550A1 (zh) 2023-08-03

Family

ID=81549990

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127087 WO2023142550A1 (zh) 2022-01-27 2022-10-24 异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品

Country Status (2)

Country Link
CN (1) CN114511810A (zh)
WO (1) WO2023142550A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511810A (zh) * 2022-01-27 2022-05-17 深圳市商汤科技有限公司 异常事件检测方法及装置、计算机设备、存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967440A (zh) * 2017-09-19 2018-04-27 北京工业大学 一种基于多区域变尺度3d-hof的监控视频异常检测方法
CN110795599A (zh) * 2019-10-18 2020-02-14 山东师范大学 基于多尺度图的视频突发事件监测方法及系统
US20210158048A1 (en) * 2019-11-26 2021-05-27 Objectvideo Labs, Llc Image-based abnormal event detection
CN113780238A (zh) * 2021-09-27 2021-12-10 京东科技信息技术有限公司 多指标时序信号的异常检测方法、装置及电子设备
CN114511810A (zh) * 2022-01-27 2022-05-17 深圳市商汤科技有限公司 异常事件检测方法及装置、计算机设备、存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967440A (zh) * 2017-09-19 2018-04-27 北京工业大学 一种基于多区域变尺度3d-hof的监控视频异常检测方法
CN110795599A (zh) * 2019-10-18 2020-02-14 山东师范大学 基于多尺度图的视频突发事件监测方法及系统
US20210158048A1 (en) * 2019-11-26 2021-05-27 Objectvideo Labs, Llc Image-based abnormal event detection
CN113780238A (zh) * 2021-09-27 2021-12-10 京东科技信息技术有限公司 多指标时序信号的异常检测方法、装置及电子设备
CN114511810A (zh) * 2022-01-27 2022-05-17 深圳市商汤科技有限公司 异常事件检测方法及装置、计算机设备、存储介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CAI, YIHENG ET AL.: "Video Anomaly Detection with Multi-Scale Feature and Temporal Information Fusion", NEUROCOMPUTING, 23 October 2020 (2020-10-23), XP086401054, ISSN: 0925-2312, DOI: 10.1016/j.neucom.2020.10.044 *
LI, XINLU; JI, GENLIN; ZHAO, BIN: "Convolutional Auto-Encoder Patch Learning Based Video Anomaly Event Detection and Localization", JOURNAL OF DATA ACQUISITION AND PROCESSING, vol. 36, no. 3, 31 May 2021 (2021-05-31), CN , pages 489 - 497, XP009548138, ISSN: 1004-9037, DOI: 10.16337/j.1004-9037.2021.03.007 *
YANG, XINXIN; LI, HUI-BO; HU, GANG: "An Abnormal Behavior Detection Algorithm Based on Imbalanced Deep Forest", JOURNAL OF CHINA ACADEMY OF ELECTRONICS AND INFORMATION TECHNOLOGY, vol. 14, no. 9, 30 September 2019 (2019-09-30), CN , pages 935 - 942, XP009548123, ISSN: 1673-5692, DOI: 10.3969/j.issn.1673-5692.2019.09.007 *

Also Published As

Publication number Publication date
CN114511810A (zh) 2022-05-17

Similar Documents

Publication Publication Date Title
CN111192292B (zh) 基于注意力机制与孪生网络的目标跟踪方法及相关设备
CN112597941B (zh) 一种人脸识别方法、装置及电子设备
Tao et al. Manifold ranking-based matrix factorization for saliency detection
CN109492627B (zh) 一种基于全卷积网络的深度模型的场景文本擦除方法
CN110428399B (zh) 用于检测图像的方法、装置、设备和存储介质
WO2021248859A1 (zh) 视频分类方法、装置、设备及计算机可读存储介质
CN110765860A (zh) 摔倒判定方法、装置、计算机设备及存储介质
EP4099217A1 (en) Image processing model training method and apparatus, device, and storage medium
CN111539290B (zh) 视频动作识别方法、装置、电子设备及存储介质
CN111310705A (zh) 图像识别方法、装置、计算机设备及存储介质
CN109413510B (zh) 视频摘要生成方法和装置、电子设备、计算机存储介质
CN112818995B (zh) 图像分类方法、装置、电子设备及存储介质
CN112487207A (zh) 图像的多标签分类方法、装置、计算机设备及存储介质
Zhou et al. Perceptually aware image retargeting for mobile devices
Zhang et al. Retargeting semantically-rich photos
CN111507285A (zh) 人脸属性识别方法、装置、计算机设备和存储介质
WO2023142550A1 (zh) 异常事件检测方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品
CN110232348A (zh) 行人属性识别方法、装置和计算机设备
Qi et al. 3D visual saliency detection model with generated disparity map
CN113537254A (zh) 图像特征提取方法、装置、电子设备及可读存储介质
CN116547711A (zh) 图像分割过程的一致性度量
Liu et al. Fastshrinkage: Perceptually-aware retargeting toward mobile platforms
Termritthikun et al. An improved residual network model for image recognition using a combination of snapshot ensembles and the cutout technique
Zhou et al. MSFlow: Multiscale Flow-Based Framework for Unsupervised Anomaly Detection
CN114049491A (zh) 指纹分割模型训练、指纹分割方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923348

Country of ref document: EP

Kind code of ref document: A1