CN114511810A

CN114511810A - Abnormal event detection method and device, computer equipment and storage medium

Info

Publication number: CN114511810A
Application number: CN202210103096.9A
Authority: CN
Inventors: 李国球; 蔡官熊; 曾星宇; 赵瑞
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-17
Also published as: WO2023142550A1

Abstract

The disclosure relates to an abnormal event detection method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring at least two image sequences; wherein, each image sequence comprises at least one frame of image; dividing each image sequence into at least two scales to obtain an image block set consisting of image blocks at the same position in all image frames at the same scale; determining correlation characteristics between the image sequences based on the image block sets of the image sequences; and determining a target image sequence with an abnormal event in the at least two image sequences according to the correlation characteristics between the image sequences. By the method, the accuracy of detecting the abnormal event can be improved.

Description

Abnormal event detection method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a method and an apparatus for detecting an abnormal event, a computer device, and a storage medium.

Background

The video anomaly detection method aims to capture an abnormal event in a video and determine the time interval of the occurrence of the abnormal event, wherein the abnormal event refers to unexpected and rarely occurring behaviors. How to improve the accuracy of abnormal event detection has always been a great concern.

Disclosure of Invention

The disclosure provides an abnormal event detection method and device, computer equipment and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided an abnormal event detection method, including: acquiring at least two image sequences; wherein each image sequence comprises at least one frame of image; dividing each image sequence into at least two scales to obtain an image block set consisting of image blocks at the same position in all image frames at the same scale; determining correlation characteristics between the image sequences based on the image block sets of the image sequences; and determining a target image sequence with an abnormal event in the at least two image sequences according to the correlation characteristics between the image sequences.

According to a second aspect of the embodiments of the present disclosure, there is provided an abnormal event detection apparatus including: an acquisition module for acquiring at least two image sequences; wherein each image sequence comprises at least one frame of image; the dividing module is used for dividing each image sequence by at least two scales to obtain an image block set consisting of image blocks at the same position in all image frames at the same scale; a first determining module, configured to determine a correlation characteristic between the image sequences based on the image block set of each image sequence; and the second determining module is used for determining a target image sequence with an abnormal event in the at least two image sequences according to the correlation characteristics between the image sequences.

According to a third aspect of embodiments of the present disclosure, there is provided a computer device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the abnormal event detection method as described in the first aspect above.

According to a fourth aspect of an embodiment of the present disclosure, there is provided a storage medium including: the instructions in said storage medium, when executed by a processor of a device, enable the device to perform the abnormal event detection method as described in the above first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, considering that some abnormal events occur in a very small area in an image frame, and some abnormal events may run through the whole image, and thus, the image frame as a whole or performing area division of a single scale cannot cope with various abnormal events, so that the disclosure performs multi-scale division on each frame of image in each image sequence, and can improve the scale robustness in abnormal event detection. In addition, the method and the device for detecting the abnormal event determine the correlation characteristics among the image sequences based on the image block set of each image sequence, so that the abnormal event detection device can combine the correlation among the image sequences on a multi-scale basis, and the detection precision of the abnormal event is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a first flowchart of an abnormal event detection method according to an embodiment of the present disclosure.

Fig. 2 is an exemplary diagram illustrating one scale division, according to an embodiment of the disclosure.

Fig. 3 is a flowchart illustrating an abnormal event detection method according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating obtaining a first feature based on a first splicing feature in an embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating a feature fusion in an embodiment of the disclosure.

Fig. 6 is a flowchart of an abnormal event detection method according to an embodiment of the present disclosure.

Fig. 7 is a fourth flowchart of an abnormal event detection method according to an embodiment of the present disclosure.

Fig. 8A is a schematic diagram illustrating an abnormal event detection method according to an embodiment of the disclosure.

Fig. 8B is a schematic process diagram of a part of the modules in fig. 8A according to an embodiment of the present disclosure.

Fig. 9 is a diagram illustrating an abnormal event detection apparatus according to an embodiment of the present disclosure.

Fig. 10 is a hardware entity diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The main execution body of the abnormal event detection method provided by the embodiment of the present disclosure may be an abnormal event detection apparatus, for example, the abnormal event detection method may be executed by a terminal device or a server or other electronic devices, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the exceptional event detection method may be implemented by the processor calling computer readable instructions stored in the memory.

In an embodiment of the present disclosure, the abnormal event detecting apparatus may include an image capturing component, so that the image capturing component is used to capture continuous frame images of a certain scene, and the images are divided into at least two image sequences. For example, the image capturing component is a camera, and can capture a video at a fixed position, and the abnormal event detection apparatus including the camera can divide the video into at least two image sequences in a time dimension, one image sequence can be referred to as a video segment, and image frames included between different video segments may not overlap. In addition, the abnormal event detection device does not comprise an image acquisition component, and the abnormal event detection device can receive at least two transmitted divided image sequences; or, after a plurality of videos of the same scene are acquired by independently arranged cameras positioned at different angles, the videos are transmitted to an abnormal event detection device, and one video received by the abnormal event detection device can be called as an image sequence. In the disclosed embodiment, an image sequence may be a sequence within a time window, i.e., image frames in the image sequence are adjacent in time.

It should be noted that, in the embodiment of the present disclosure, a specific obtaining manner of the image sequence and a content of at least one frame of image included in the image sequence may be determined according to an actual requirement and an application scenario, and the embodiment of the present disclosure is not limited.

Fig. 1 is a first flowchart of an abnormal event detection method according to an embodiment of the present disclosure, and as shown in fig. 1, the abnormal event detection method includes the following steps:

s11, acquiring at least two image sequences; wherein each image sequence comprises at least one frame of image;

s12, dividing each image sequence into at least two scales to obtain an image block set consisting of image blocks at the same position in all image frames at the same scale;

s13, determining correlation characteristics among the image sequences based on the image block sets of the image sequences;

and S14, determining a target image sequence with abnormal events in the at least two image sequences according to the correlation characteristics between the image sequences.

In the embodiment of the formula, after the abnormal event detection device acquires the at least two image sequences, at least two scales of each image sequence are divided, that is, at least two scales of each frame of image included in the image sequences are divided. And after the image sequence is subjected to multi-scale division, the image blocks at the same position in all the image frames under the same scale form an image block set.

Illustratively, the abnormal event detecting device divides the video V into T image sequences which do not overlap

For each image sequence, using different sliding window sizes of R groups for each frame image

And (5) dividing. Fig. 2 is an exemplary diagram of one scale division shown in the embodiment of the present disclosure, and as shown in fig. 2, a certain image sequence is divided by 3 scales (R equals 3), and the number of corresponding image blocks in each image frame is 1, 6, and 15, respectively. In an embodiment of the present disclosure, co-located image blocks in different image frames of an image sequence are taken as a whole to form a set of image blocks, i.e. a small cube as shown in fig. 2. Shown in the figure

The method comprises the following steps of including a cube, namely corresponding to an image block set under the scale;

including 6 cubes, i.e. 6 image block sets corresponding to the scale;

including 15 cubes, i.e. corresponding to 15 sets of image blocks at this scale.

In the embodiment of the disclosure, the image block set of the same scale can be represented as

Wherein N is_rIs the number of sets of image blocks corresponding to the scale. As shown in FIG. 2, the division N of the first scale_rIs 1; division by a second scale N_rIs 6; division of the third scale N_rIs 15.

It should be noted that, in the embodiment of the present disclosure, when each frame of image in the image sequence is divided by a corresponding scale, the image blocks corresponding to the same scale have the same size. In addition, when each frame of image is divided into non-overlapping image blocks by using the sliding window, the number of the image blocks of each frame of image in the corresponding scale may be a result of rounding down the ratio of the size of each frame of image to the size of the sliding window, that is, when the size of the frame of image and the size of the sliding window cannot be completely divided, the image blocks are not additionally obtained by, for example, complementing "0" or complementing "1", and the content of each image block in the present disclosure belongs to the content in the frame of image before division.

In step S13, after the abnormal event detection device obtains the image block set of each image sequence, the feature that can characterize each image sequence can be obtained, and then the correlation feature between the image sequences is obtained based on the feature that characterizes the image sequence.

In an embodiment, when obtaining the features that can characterize the image sequence based on each image block set of the image sequence, for example, the features of each image block in a plurality of image block sets of different scales can be directly spliced to be used as the features of the image sequence. As in the previous example, each image sequence includes 3 frames of images, each frame of image includes 3 scales of sets of image blocks, and assuming that each image block corresponds to one feature, the number of features of the image sequence is: the image blocks after each frame of image is divided into multiple scales are 66 features in total, namely, (1+6+15) × 3. In this embodiment, if at least two image sequences are obtained by dividing the same video in the time dimension, the correlation feature between the image sequences obtained based on the features of the image sequences may be referred to as a time correlation feature.

In another embodiment, when obtaining the features that can characterize the image sequence based on each image block set of the image sequence, for example, correlation features between different image block sets may be determined for different image block sets of the same scale, and then the features of the image sequence may be obtained based on the correlation features between different image block sets of the same scale. Or for each frame of image, firstly obtaining the correlation characteristics among the image blocks, and then obtaining the characteristics of the image sequence based on the correlation characteristics among the image blocks.

It can be understood that since the image blocks have the location attribute, the correlation features between different sets of image blocks in the same scale or between a plurality of image blocks in a frame image have a spatial attribute, and the correlation features can be characterized as spatial correlation. In the embodiment of the present disclosure, if at least two image sequences are obtained by dividing the same video in the time dimension, the correlation feature between the image sequences obtained based on the features of the image sequences may be referred to as a spatio-temporal correlation feature.

Of course, if the at least two image sequences are image sequences of different angles of the same scene in the embodiment of the present disclosure, the correlation feature between the at least two image sequences may be understood as a spatial correlation feature. In addition, if the correlation characteristics between different image block sets of the same scale or the correlation characteristics between a plurality of image blocks of a frame of image are obtained first, and then the correlation characteristics of each image sequence are obtained based on the correlation characteristics, the correlation characteristics of the image sequence can be understood as including the characteristics of local spatial correlation and global spatial correlation. The local spatial correlation is associated with the position attribute of the image block, and the global spatial correlation is associated with the acquisition angle attribute of the image sequence.

The correlation features between the image sequences are used to characterize the correlation between the image sequences, and may include, for example, features obtained by weighting the features of each image sequence with different weights, and the correlation between different image sequences is represented by the distribution of weights. In addition, the correlation feature of the image sequences may further include a feature for any image sequence, and a partial feature of other image sequences is fused, that is, the association relationship between the image sequences is embodied by feature fusion. It should be noted that the present disclosure does not specifically limit the manner of obtaining the correlation features.

In the embodiment of the present disclosure, if there are T image sequences, the correlation characteristic between the image sequences is represented by φ_STIs shown as being phi_STIncluding a sequence of T imagesThe corresponding features are only the features corresponding to each image sequence are subjected to correlation processing based on the features of other image sequences.

In step S14, after obtaining the correlation features between the image sequences, the abnormal event detection apparatus may determine the target image sequence with the abnormal event in at least two image sequences according to the correlation features, for example, by using a conventional feature recognition method or a trained model.

It can be understood that, in the embodiment of the present disclosure, considering that some abnormal events occur in a very small area in an image frame, and some abnormal events may run through the whole screen, and thus, directly dividing the image frame as a whole or performing area division in a single scale cannot cope with various abnormal events, the present disclosure performs multi-scale division on each frame of image in each image sequence, and can improve the robustness of scale in abnormal event detection. In addition, the method and the device for detecting the abnormal event determine correlation characteristics among the image sequences based on the image block set of each image sequence, for example, the correlation in time and/or space is obtained through weight distribution, so that the abnormal event detection device can combine the correlation among the image sequences on a multi-scale basis, and the detection precision of the abnormal event is improved.

Fig. 3 is a flowchart illustrating a second abnormal event detection method according to an embodiment of the disclosure, and as shown in fig. 3, step S13 in fig. 1 may include the following steps:

s13a, aiming at each image sequence, obtaining a first feature corresponding to the scale based on each image block set under the same scale; wherein the first feature comprises correlation among image block sets of the same scale;

s13b, fusing the first features corresponding to all scales in the same image sequence to obtain a second feature of each image sequence;

s13c, determining the correlation feature between the image sequences based on the second feature of the image sequences.

In step S13a, after determining the scale-corresponding image block set, the image block set including the scale-corresponding image blocks is obtainedFirst characteristics of the correlation between sets of image blocks of the same scale, such as that shown in FIG. 2

The correlation characteristics between each small cube block. It will be appreciated that since the image blocks in the set of image blocks carry a position attribute, each set of image blocks also carries a position attribute, the first feature obtained is a feature that includes spatial correlation between the sets of image blocks.

Illustratively, if the abnormal event detecting device performs the scaling of the R group, the first feature is used

Showing that the first feature corresponding to the scale obtained by the abnormal event detecting device

The total is R group.

In step S13b, the first features corresponding to the respective scales in the same image sequence are fused to obtain the second features of each image sequence, and if there are T image sequences, the second features are phi'_tIf so, the abnormal event detection device obtains a T group phi'_t。

In step S13c, a correlation feature between the image sequences is determined based on the second feature of each image sequence, and since the first feature is a spatial correlation feature including between the image block sets, if at least two image sequences are image sequences of the same video in different time periods, the correlation feature between the image sequences obtained in this step may be a spatio-temporal correlation feature. In addition, as in the above analysis, if at least two image sequences are image sequences of different angles of the same scene, the correlation feature between the image sequences may be a feature including a local spatial correlation and a global spatial correlation.

It is understood that, in the embodiment of the present disclosure, the set of image blocks composed of image blocks at the same position in all frame images of the image sequence is used as the processing unit to obtain the first feature without focusing on one image block of each frame image, so that the amount of calculation can be relatively reduced when the correlation feature between the image sequences is further obtained based on the first feature; and the obtained correlation characteristics among the image block sets comprise multi-dimensional correlation characteristics, so that the accuracy of abnormal event detection can be improved.

In an embodiment, the obtaining a first feature corresponding to a scale based on each image block set at the same scale includes:

performing feature extraction on each image block set under the same scale to obtain features corresponding to the image block sets;

splicing the features of the image block set with the same scale to obtain a first splicing feature corresponding to the scale;

and constructing an association relation between image block sets of the same scale represented by the first splicing features by utilizing an attention mechanism and convolution processing based on the first splicing features corresponding to the scale, and obtaining the first features corresponding to the scale.

In the embodiment, the image block set is taken as a whole to obtain the features of the image block set, and then the features of the image block set with the same scale are spliced to obtain the first splicing feature corresponding to the scale.

When the features of the image block set with the same scale are spliced, the image block set can be horizontally spliced as a whole. For example, if the dimension of the feature corresponding to each image block set is D-dimension after feature extraction, the first stitching feature is used

Is shown to be

The dimensions of (a) are: number of sets of image blocks of the same scale D, N_r*D。

In an embodiment, the performing feature extraction on each image block set under the same scale to obtain features corresponding to the image block sets includes:

and performing feature extraction on each image block set under the same scale to obtain the corresponding features of the image block set, wherein the features comprise time sequence information among the image blocks in the image block set.

As mentioned above, each image frame in the image sequence is adjacent in time, i.e. there is timing information between image frames within the image sequence, and thus also between image blocks in the image set. In this embodiment, when performing feature extraction on the image block set, a feature including timing information between image blocks in the image block set may be obtained.

For example, the present disclosure may utilize a preset I3D feature encoder to perform feature extraction on each image block set at the same scale, so as to obtain features including timing information between image blocks in the image block set. It can be understood that, because the network structure of the I3D feature encoder is deep, and a 3-dimensional convolution kernel is used, and the image block set contains timing information, the timing information of the image block set can be contained by using the 3-dimensional convolution kernel, so that the feature extraction is more complete.

In the embodiment of the disclosure, after the first splicing feature corresponding to the scale is obtained, the association relationship between the image block sets of the same scale represented by the first splicing feature can be established, so as to obtain the first feature corresponding to the scale.

It should be noted that, in the embodiments of the present disclosure, as described above, the first feature corresponding to the scale may be used

That means, the dimension of the first feature corresponding to the obtained scale is the same as the dimension of the first stitching feature, except that the first feature includes the correlation between the image sequences of the same scale,

may also be N_r*D。

It can be understood that the association relationship between the image block sets of the same scale represented by the first stitching feature is constructed through a self-attention mechanism and convolution processing in the present disclosure, and based on the machine vision theory, the obtained first feature can have a better enhancement effect, for example, a part of interest (i.e., a part where an anomaly may exist) in each image block set of the same scale is selectively highlighted, so that the detection effect of the abnormal event can be further improved.

In an embodiment, the constructing, based on the first mosaicing feature corresponding to the scale, an association relationship between image block sets of the same scale and characterized by the first mosaicing feature by using a self-attention mechanism and convolution processing to obtain the first feature corresponding to the scale includes:

determining a weight matrix based on the self-attention mechanism and the first stitching feature; wherein the weight matrix comprises: representing the weight value of the probability of the abnormality of each image block set with the same scale;

obtaining weighted features based on the weight matrix and the first splicing features;

performing convolution processing on the first splicing characteristics to obtain characteristics after convolution;

obtaining the first feature based on the weighted feature, the convolved feature, and the first stitching feature.

In this embodiment, a weight matrix is determined based on a self-attention mechanism, where a weight value in the weight matrix represents a probability that each image block set of the same scale has an abnormality, and if the weight value is larger, the probability that the image block set has the abnormality is larger.

In this embodiment, the first stitching feature is further processed by convolution, for example, the first stitching feature is processed by non-hole convolution or hole convolution. When the convolution processing is performed on the first splicing feature, the first splicing feature includes the features of the image block sets of the same scale, so that the features of the plurality of image block sets can be associated through the convolution operation of the convolution kernel.

In an embodiment, the convolving the first stitched feature to obtain a convolved feature includes:

convolving the first splicing characteristics by utilizing at least two cavity convolution kernels to obtain convolution results corresponding to the cavity convolution kernels; wherein the void rates of at least two of said void convolution kernels are different;

and splicing convolution results corresponding to the cavity convolution kernels to obtain the characteristics after convolution.

In this embodiment, the first mosaic feature is processed by a hole convolution, for example, the at least two hole convolution kernels include three, each hole convolution kernel is a one-dimensional convolution kernel, and the hole rates are 1, 2, and 4, respectively. If the first splicing characteristic

Is N_rD, after processing by using three cavity convolution kernels, the dimension of the convolution result corresponding to each cavity convolution kernel may be N_rD/4, splicing convolution results corresponding to the convolution kernels of the cavities to obtain a characteristic N after convolution_r*3D/4。

In the disclosed embodiment, the result after convolution can be used

Indicating that DC1, DC2 and DC3 are convolution results corresponding to the hole convolution kernels respectively.

Of course, the present disclosure is not limited to the above 3 one-dimensional cavity convolution kernels, and since the finally weighted feature, the convolved feature, and the first splicing feature need to be matched together to form the first feature, the number and size of the cavity convolution kernels and the corresponding cavity rate may be set according to actual needs.

It can be understood that since the cavity convolution can expand the receptive field, and when a plurality of cavity convolution kernels with different cavity rates are superposed, different receptive fields bring multi-scale information, the first splicing feature is enhanced by the feature after the cavity convolution obtained by convolving a plurality of cavity convolution kernels and splicing convolution results.

In one embodiment, the determining a weight matrix based on the self-attention mechanism and the first stitching feature includes:

performing dimension reduction processing on the first splicing feature to obtain a first splicing feature after dimension reduction;

performing convolution on the first splicing characteristic subjected to dimensionality reduction by using a preset first convolution kernel to obtain a first convolution result;

performing convolution on the first splicing characteristics subjected to dimensionality reduction by using a preset second convolution kernel to obtain a second convolution result;

and determining the weight matrix by utilizing the self-attention mechanism according to a result obtained by multiplying the first convolution result and the transpose of the second convolution result.

In the embodiment of the disclosure, the dimension reduction processing is performed on the first splicing feature first to reduce the subsequent calculation amount. Illustratively, the dimensionality reduction may be performed by a one-dimensional convolution matrix. Illustratively, the reduced dimension first stitching feature may be used

Is expressed in dimension N_rD/4. Of course, the present disclosure is not limited to 1/4 reducing the feature dimension of each image block set to the original feature dimension.

In the embodiment of the disclosure, the self-attention mechanism is based on predicting covariance between any image block set and other image block sets in the same scale, each image block set is regarded as a random variable, and the obtained weight in the weight matrix is the correlation between each image block set and all image block sets.

In this embodiment, the preset first convolution kernel and the preset second convolution kernel may both be one-dimensional convolution kernels, and the first reduced-dimension splicing feature is convolved by using the preset first convolution kernel and the preset second convolution kernel, so that the obtained first convolution result and the obtained second convolution result may both be one-dimensional vectors. The product of the first convolution result and the transpose of the second convolution result, an attention map obtained via a normalized exponential function (softmax) of the attention mechanism, i.e. a weight matrix, which is essentially a covariance matrix.

Illustratively, if the dimension of the first convolution result is N_rD/4, the dimension of the second convolution result is D/4N_rThen the dimension of the weight matrix is N_r*N_r。

In one embodiment, the obtaining the weighted features based on the weight matrix and the first stitched features includes:

performing convolution on the dimensionality-reduced first splicing characteristic by using a preset third convolution core to obtain a third convolution result;

multiplying the weight matrix and the third convolution result to obtain a weight matrix;

and determining the result of convolution of the weighting matrix and a preset fourth convolution kernel and the sum of the first splicing characteristics after dimensionality reduction as the weighted characteristics.

In this embodiment, the preset third convolution kernel and the preset fourth convolution kernel may also be one-dimensional convolution kernels, a third convolution result obtained by performing convolution on the reduced-dimension first splicing feature by using the preset third convolution kernel is multiplied by a weight matrix, each item in the obtained weight matrix is a weighted sum of image block sets in the reduced-dimension first splicing feature, and the weight is a covariance between image block sets of the same scale included in the reduced-dimension first splicing feature.

Illustratively, the dimension of the third convolution result may be N_rD/4, dimension of weighting matrix is N_rD/4, the dimension of the weighted feature may be N_r*D/4。

In the embodiment of the disclosure, the result obtained by convolving the weighting matrix with the preset fourth convolution sum is summed with the first splicing feature after dimensionality reduction, that is, residual error connection is performed, and the obtained weighted feature has stronger representation capability on each image block set.

In the present disclosure, the above process of obtaining the weight matrix and the weighted features can be represented by the following formulas (1) and (2):

in the above formulas (1) and (2), W_θIn order to preset the first convolution kernel,

for presetting a second convolution kernel, W_gFor presetting a third convolution kernel, W_zIn order to preset the fourth convolution kernel,

is the first stitching feature after dimensionality reduction. The softmax portion derives a weight matrix,

i.e. a weighting matrix, is,

i.e. the weighted features.

In one embodiment, the obtaining the first feature based on the weighted feature, the convolved feature, and the first stitching feature includes:

and splicing the weighted features and the convolved features, and then adding the weighted features and the convolved features to obtain the first features.

In this embodiment, the first characteristic can be expressed by the following formula (3):

wherein,

that is to say the weighted features, the features,

as a result of the convolution, the result,

in order to provide the first splice feature,

for the first feature, the dimension is N_r*D。

FIG. 4 is a schematic diagram illustrating a first feature obtaining method based on a first stitching feature in an embodiment of the present disclosure, and as shown in FIG. 4, the right branch is a process of determining a weight matrix based on a self-attention mechanism and the first stitching feature, i.e., an attention diagram M in FIG. 4, and obtaining a weighted feature based on the weight matrix and the first stitching matrix; the left branch is a process of processing the first splicing feature based on the hole convolution to obtain a feature after convolution, and specific description may refer to the foregoing description, which is not described herein again.

In an embodiment, the fusing the first features corresponding to the respective scales in the same image sequence to obtain the second feature of each image sequence includes:

reconstructing the first features of the same scale according to the position relation of each image block set to obtain reconstruction features corresponding to the scale;

after the reconstruction features corresponding to the scales are convoluted by a preset fifth convolution kernel, converting the reconstruction features into one-dimensional feature vectors through a full connection layer;

and accumulating the one-dimensional feature vectors of all scales to obtain a second feature of each image sequence.

In this embodiment, because the first feature corresponding to the scale is obtained by stitching the image block sets of the same scale, and the first feature corresponding to the scale has the same dimension as the first stitching feature, the first feature may be understood as a result of horizontal stitching of correlation features of the image block sets of the same scale. Because the image blocks in the image block set have position attributes, the method can reconstruct the image blocks in each image block set according to the position relation of the image blocks to obtain the reconstruction characteristics corresponding to the scales, and can understand the reconstruction characteristics but do not use the position attributes to reconstruct the image blocks in each image block setThe reconstructed feature is a three-dimensional vector, which may be used in embodiments of the present disclosure

To indicate. Each element in the reconstruction feature characterizes an image block set, and the feature dimension is D dimension.

After reconstruction features are obtained based on position relation reconstruction of image blocks in an image block set, converting the reconstruction features into one-dimensional feature vectors through a preset fifth convolution kernel and a full connection layer, wherein the preset fifth convolution kernel can be a two-dimensional convolution kernel and is used for performing feature dimension reduction convolution processing on the reconstruction features, and the one-dimensional feature vectors obtained after feature conversion after two-dimensional convolution is performed on the full connection layer can be used

The characteristic dimension of the representation may be the D dimension. It is understood that the one-dimensional feature vector is a feature that characterizes a set of image blocks at the same scale.

Since the second feature of the image sequence is obtained by accumulating the one-dimensional feature vectors of each scale, it can be understood that the second feature of the image sequence is a feature fused with multiple scales.

Fig. 5 is a schematic diagram of a principle example of feature fusion in an embodiment of the present disclosure, which is described by taking a first feature corresponding to one scale as an example, as shown in fig. 5, what is shown by a dashed box L51a is the first feature corresponding to one scale, where the first feature includes correlation between image block sets of the same scale. The cube L52a in the drawing represents a reconstruction feature obtained by reconstructing the first feature according to the positional relationship of the image blocks in each image block set. The reconstructed features are converted into one-dimensional feature vectors after passing through the two-dimensional convolutional layer L53a and the fully connected layer L54 a. As shown in fig. 5, each first feature corresponds to a reconstruction feature, and the reconstruction features are converted into one-dimensional feature vectors through the two-dimensional convolution layer and the full link layer, and then accumulated to obtain L50, i.e., a second feature corresponding to the image sequence. Wherein, the preset fifth convolution kernel of the present disclosure can be included in the two-dimensional convolution layer. L53a, L53b, and L53c shown in fig. 5 may be similar two-dimensional convolution layers, and L54a, L54b, and L54c may be similar all-connected layers, which does not limit the embodiment of the present disclosure.

It can be understood that the present disclosure enables the abnormal event detection apparatus to have a local to global perception on the image frames in the image sequence by fusing the features of the image block sets of all scales, thereby improving the robustness to the abnormal events of different scales.

In one embodiment, the determining the correlation feature between the image sequences based on the second feature of each image sequence includes:

splicing the second features of the image sequences to obtain second splicing features;

and constructing an association relation between different image sequences characterized by the second splicing characteristics based on the second splicing characteristics, and determining the correlation characteristics between the image sequences.

In the embodiment of the disclosure, the correlation characteristic between each image sequence may be determined based on an acquisition manner of the correlation characteristic between the image block sets of the same scale, that is, an acquisition manner of the first characteristic corresponding to the scale.

In this embodiment, the second feature of each image sequence may be stitched, for example, horizontally, to obtain a second stitched feature, and then, based on the principle of fig. 4, a weight matrix of the image sequence is determined based on the self-attention mechanism and the second stitched feature, where the weight matrix of the image sequence includes: and a weight value representing the probability of abnormality in each image sequence. And then, obtaining weighted features corresponding to all the image sequences based on the weight matrix of the image sequences and the second splicing features. When processing is performed based on the self-attention mechanism, dimension reduction processing may be performed on the second stitching feature, for example, dimension reduction processing may be performed by using one-dimensional convolution. And in addition, convolution processing is carried out on the second splicing features to obtain the convolved features corresponding to all the image sequences, and further the weighted features corresponding to all the image sequences, the convolved features corresponding to all the image sequences and the second splicing features are used for determining the correlation features among the image sequences.

Illustratively, if the second feature of the image sequence is phi'_tThat is, T groups of image sequences have T groups of phi'_tSplicing the second features of the image sequences to obtain second splicing features

And (4) showing.

The above process can be represented by the following formulas (4) to (6):

wherein,

the second splicing feature is subjected to dimension reduction; w_θ、

W_gAnd W_zThe softmax section derives a weight matrix for the image sequence, which can be referred to the descriptions in the foregoing equations (1) and (2);

the corresponding weighting matrices for all image sequences,

weighted features belonging to all image sequences; phi is a_*,AThen it is allFeatures, phi, of the image sequence after convolution_STFor representing the correlation characteristics between the image sequences.

It should be noted that, in the embodiment of the present disclosure, phi_STThe dimension of (D) may be the number of image sequences per feature dimension of the image sequence, i.e., dimension T x D.

Fig. 6 is a flowchart of a third method for detecting an abnormal event according to an embodiment of the present disclosure, as shown in fig. 3, step S14 in fig. 1 may include the following steps:

s14a, detecting correlation characteristics among the image sequences based on a preset abnormal prediction model to obtain a prediction result of each image sequence; the preset abnormal prediction model is a model obtained by training by adopting a weak supervision training method;

and S14b, determining the target image sequence with the abnormal event according to the prediction result of each image sequence.

As described above, a target image sequence with an abnormal event can be determined from at least two image sequences by using a conventional feature recognition method or a trained model according to the correlation features between the image sequences. In this embodiment, an abnormality detection model obtained by weak supervised training, which is trained in advance, is used.

When weak supervision training is carried out, a loss function needs to be constructed, the loss function is used for estimating the inconsistency degree between the predicted value and the true value of the model, and generally, the smaller the loss function value is, the better the robustness of the model is. In the training process, parameters of the model can be adjusted through constraint on the loss function, so that a better model can be obtained through training.

In the embodiment of the present disclosure, the features of the training samples are obtained from the training samples according to the descriptions in fig. 1 to fig. 5, then a loss function is constructed based on the obtained features of the training samples and the sample labels, and the parameters of the model are continuously modified to obtain a model with better detection effect. In the embodiment of the present disclosure, the initial model is, for example, a Convolutional Neural Network (CNN) model, a Deep Neural Network (DNN) model, and the like, which is not limited herein.

In one embodiment, the method further comprises:

respectively selecting K sample image sequences with larger characteristic gradients from positive samples and negative samples in a training sample set to calculate an average characteristic gradient; wherein K is a positive integer greater than 1;

constructing a loss function according to the average characteristic gradient corresponding to the positive sample and the average characteristic gradient corresponding to the negative sample;

and training based on the loss function to obtain the preset abnormal prediction model.

In the embodiment of the present disclosure, the training sample set includes a positive sample and a negative sample, where the positive sample refers to a sample in which no abnormal event exists in an image sequence included in the sample, and the negative sample refers to a sample in which an abnormal event exists in an image sequence included in the sample. A sample may be a video, which is further divided into different image sequences, a video corresponding to a label, but an image sequence without labels. In embodiments of the present disclosure, each video may be likened to a "packet," and the sequence of images may be likened to an "instance," i.e., the "packet" is tagged, but the "instance" is untagged.

In the embodiment of the disclosure, K sample image sequences with large characteristic gradients are respectively selected for a positive sample and a negative sample to calculate an average characteristic gradient, and then a loss function is constructed based on the average characteristic gradient corresponding to the positive sample and the average characteristic gradient corresponding to the negative sample.

If the sample characteristics obtained based on the method for the T image sequences included in one video in the training sample are as follows

The specific method for constructing the loss function is as follows:

A. the first K image sequences with larger characteristic gradient are selected from all the image sequences, and the average characteristic gradient is calculated according to the following formula (7):

wherein, | φ ″_t||₂The feature gradient in this disclosure is obtained by computing the 2-norm of the feature, which is the 2-norm of the feature.

B. Abnormal video phi based on video tag identification_ST’⁺And normal video phi_ST′^-The ranking loss is calculated as follows in equation (8):

wherein g (phi)_ST′⁺) Is the average characteristic gradient, g (phi), of the first K image sequences in normal video_ST′^-) Is the average characteristic gradient of the first K image sequences in the abnormal video.

C. Inputting the characteristics of the first K image sequences included in each video into the original model to predict abnormal scores to obtain

(one image sequence corresponds to one prediction score), calculating cross entropy loss based on the predicted anomaly score and the label corresponding to the video, as shown in the following formula (9):

wherein s represents the predicted abnormal score, y represents the label corresponding to the video, for example, the label value of the abnormal video is 1, and the label value of the normal video is 0.

D. Introducing a sparse constraint and a time smoothing constraint, and determining a total loss function as the following formula (10):

wherein λ is_fm,λ₁,λ₂Is used for balancing various lossesThe factor of the loss is a factor of the loss,

a sparse constraint is represented by a graph of,

representing a temporal smoothing constraint.

The present disclosure may construct a loss function based on the above steps to preset an abnormality detection model. The correlation characteristics between image sequences are determined_STAfter the image sequences are input into the preset anomaly detection model, the prediction result of each image sequence can be obtained, for example, the prediction result is a prediction score, each prediction score is compared with a preset score threshold value, and for example, the image sequences with the prediction scores larger than the preset score threshold value are determined as target image sequences with abnormal events.

It can be understood that, in the present disclosure, the abnormal event detection model obtained based on the weak supervision training method is used to process the correlation characteristics of the image sequence to determine the target image sequence with the abnormal event, and compared with the conventional method, the generalization capability of the preset abnormal event detection model is better; in addition, compared with a model obtained by an unsupervised method, the accuracy of abnormal event detection is better because the supervised training mode has the guidance of a training label.

Fig. 7 is a flowchart of a method for detecting an abnormal event in the embodiment of the present disclosure, as shown in fig. 7, step S11 in fig. 1 may include the following steps:

s11a, acquiring a video to be detected;

s11b, determining a difference value between adjacent frame images in the video to be detected;

s11c, determining the image frame in the front time as the tail frame of the image sequence and the image frame in the back time as the head frame of the image sequence adjacent to the image sequence in the adjacent frame images with the difference value larger than the preset difference threshold value.

In this embodiment, at least two image sequences are from the same video, i.e. the video to be detected. When the image sequences are divided based on the video to be detected, the method detects the difference value between adjacent frame images in the video to be detected in a clustering mode, and uses some image frames with similar contents in the image frames as one image sequence.

It should be noted that, when determining a difference value between two adjacent frame images in a video to be detected, the difference value may be determined by, for example, differentiating two adjacent frame images, but the present disclosure does not limit the manner. In addition, the manner in which the abnormal event detection apparatus acquires at least two image sequences in the present disclosure is not limited to the manner in this embodiment, and may be, for example, an image sequence obtained by dividing a video into equal time lengths based on time, and the like, and is not described in detail here.

Fig. 8A is a schematic diagram of an abnormal event detection method according to an embodiment of the present disclosure, and fig. 8B is a schematic diagram of a processing procedure of a part of modules in fig. 8A according to an embodiment of the present disclosure. The video segment, i.e., the image sequence, identified by L81 in fig. 8A shows a total of 3 image sequences. After each image sequence is input into the multi-scale patch generator L82, an obtained patch is an image block set proposed in the present disclosure. After each patch is input into a pre-training feature encoder L83 to extract features, patch spatial relationship modeling can be performed based on the module identified by L84. As shown in fig. 8B, for one image sequence, after the image sequence is input to the multi-scale patch generator L82 (R sets of scales in total), a scale-corresponding patch is obtained

May comprise a plurality of sets of image blocks. For scale correspondence

Obtaining a first splicing feature corresponding to a scale after passing through a pre-training feature encoder identified by L3

Then the first spelling with corresponding scaleThe correlation between the image block sets of the same scale, i.e. the scale-corresponding first feature, can be obtained by modeling the patch spatial relationship identified by L84 according to the features, such as shown in fig. 8B

The first feature after the modeling of the patch spatial relationship corresponding to each scale, that is, the first feature of the same image sequence at different scales can be obtained through the patch aggregation module identified by L85

And performing stitching, namely obtaining a corresponding second feature of the image sequence, namely one of the T-feature segments shown as L86 in fig. 8A. Then, the second features of all image sequences, i.e. T feature segments shown in L86, are passed through a video temporal relation module identified by L87 to obtain spatio-temporally modeled features, i.e. correlation features between image sequences as referred to in this disclosure. Finally, the relevance features are input into a pre-trained classifier L88 to obtain the prediction scores of each image sequence, and whether the image sequence has abnormal events or not can be determined based on the prediction scores of each image sequence. The pre-trained classifier can be obtained based on a weak supervision training method, a loss function of the model is constructed through the video level labels of the training samples and the prediction scores of the training samples, and model parameters are fixed when the loss meets the convergence condition, so that the trained classifier is obtained.

Fig. 9 is a diagram illustrating an abnormal event detection apparatus according to an embodiment of the present disclosure. Referring to fig. 9, the abnormal event detecting apparatus 900 includes:

an obtaining module 901, configured to obtain at least two image sequences; wherein each image sequence comprises at least one frame of image;

a dividing module 902, configured to divide each image sequence by at least two scales to obtain an image block set formed by image blocks at the same position in all image frames at the same scale;

a first determining module 903, configured to determine a correlation characteristic between the image sequences based on the image block set of each image sequence;

a second determining module 904, configured to determine, according to a correlation characteristic between the image sequences, a target image sequence with an abnormal event in the at least two image sequences.

In some embodiments, the first determining module 903 is configured to, for each of the image sequences, obtain a first feature corresponding to a scale based on each image block set at the same scale; wherein the first feature comprises correlation among image block sets of the same scale; fusing the first features corresponding to all scales in the same image sequence to obtain a second feature of each image sequence; determining the correlation feature between the image sequences based on the second feature of the image sequences.

In some embodiments, the first determining module 903 is configured to perform feature extraction on each image block set in the same scale to obtain features corresponding to the image block sets; splicing the features of the image block set with the same scale to obtain a first splicing feature corresponding to the scale; and constructing an association relation between image block sets of the same scale represented by the first splicing features by utilizing an attention mechanism and convolution processing based on the first splicing features corresponding to the scale, and obtaining the first features corresponding to the scale.

In some embodiments, the first determining module 903 is configured to determine a weight matrix based on the self-attention mechanism and the first stitching characteristic; wherein the weight matrix comprises: representing the weight value of the probability of the abnormality of each image block set with the same scale; obtaining weighted features based on the weight matrix and the first splicing features; performing convolution processing on the first splicing characteristics to obtain characteristics after convolution; obtaining the first feature based on the weighted feature, the convolved feature, and the first stitching feature.

In some embodiments, the first determining module 903 is configured to perform dimension reduction on the first splicing feature to obtain a dimension-reduced first splicing feature; performing convolution on the first splicing characteristic subjected to dimensionality reduction by using a preset first convolution kernel to obtain a first convolution result; performing convolution on the first splicing characteristics subjected to dimensionality reduction by using a preset second convolution kernel to obtain a second convolution result; and determining the weight matrix by utilizing the self-attention mechanism according to a result obtained by multiplying the first convolution result and the transpose of the second convolution result.

In some embodiments, the first determining module 903 is configured to perform convolution on the reduced-dimension first splicing feature by using a preset third convolution core to obtain a third convolution result; multiplying the weight matrix by the third convolution result to obtain a weighting matrix; and determining the result of convolution of the weighting matrix and a preset fourth convolution kernel and the sum of the first splicing characteristics after dimensionality reduction as the weighted characteristics.

In some embodiments, the first determining module 903 is configured to convolve the first splicing feature with at least two hole convolution kernels, so as to obtain a convolution result corresponding to each of the hole convolution kernels; wherein the void rates of at least two of said void convolution kernels are different; and splicing convolution results corresponding to the cavity convolution kernels to obtain the characteristics after convolution.

In some embodiments, the first determining module 903 is configured to splice the weighted features and the convolved features, and add the spliced features and the first spliced features to obtain the first features.

In some embodiments, the first determining module 903 is configured to perform feature extraction on each image block set in the same scale, and obtain a feature corresponding to the image block set and including timing information between the image blocks in the image block set.

In some embodiments, the first determining module 903 is configured to reconstruct the first feature of the same scale according to a position relationship of each image block set, so as to obtain a reconstructed feature corresponding to the scale; after the reconstruction features corresponding to the scales are convoluted by a preset fifth convolution kernel, converting the reconstruction features into one-dimensional feature vectors through a full connection layer; and accumulating the one-dimensional feature vectors of all scales to obtain a second feature of each image sequence.

In some embodiments, the first determining module 903 is configured to splice the second features of the image sequences to obtain second spliced features; and constructing an association relation between different image sequences characterized by the second splicing feature based on the second splicing feature and based on a self-attention mechanism and convolution processing, and determining the correlation feature between the image sequences.

In some embodiments, the second determining module 904 is configured to detect a correlation feature between the image sequences based on a preset abnormal prediction model, and obtain a prediction result of each image sequence; the preset abnormal prediction model is a model obtained by training by adopting a weak supervision training method; and determining the target image sequence with the abnormal event according to the prediction result of each image sequence.

In some embodiments, the apparatus further comprises:

a calculating module 905, configured to select, for positive samples and negative samples in the training sample set, K sample image sequences with a large feature gradient to calculate an average feature gradient; wherein K is a positive integer greater than 1;

a constructing module 906, configured to construct a loss function according to the average characteristic gradient corresponding to the positive sample and the average characteristic gradient corresponding to the negative sample;

a training module 907, configured to obtain the preset anomaly prediction model based on the loss function training.

In some embodiments, the obtaining module 901 is configured to obtain a video to be detected; determining a difference value between adjacent frame images in the video to be detected; and determining the image frame in the front time as the tail frame of the image sequence and the image frame in the back time as the head frame of the image sequence adjacent to the image sequence in the adjacent frame images with the difference value larger than a preset difference threshold value.

The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

It should be noted that, in the embodiment of the present disclosure, if the above-mentioned abnormal event detection method is implemented in the form of a software functional module, and is sold or used as a stand-alone product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present disclosure are not limited to any specific combination of hardware and software.

Correspondingly, the embodiment of the present disclosure provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and the processor implements the steps of the method when executing the program.

Correspondingly, the disclosed embodiments provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the above-described method. The computer readable storage medium may be transitory or non-transitory.

Accordingly, embodiments of the present disclosure provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, which when read and executed by a computer, implements some or all of the steps of the above method. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Here, it should be noted that: the above description of the storage medium, computer program product and device embodiments, like the description of the method embodiments above, has similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium, the computer program product and the device of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

It should be noted that fig. 10 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure, and as shown in fig. 10, the hardware entity of the computer device 1000 includes: a processor 1001, a communication interface 1002, and a memory 1003, wherein:

the processor 1001 generally controls the overall operation of the computer device 1000.

The communication interface 1002 may enable the computer device to communicate with other terminals or servers via a network.

The Memory 1003 is configured to store instructions and applications executable by the processor 1001, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 1001 and modules in the computer apparatus 1000, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM). Data transmission between the processor 1001, the communication interface 1002, and the memory 1003 can be performed via the bus 1004.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

Alternatively, the integrated unit of the present disclosure may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only an embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of abnormal event detection, the method comprising:

acquiring at least two image sequences; wherein each image sequence comprises at least one frame of image;

dividing each image sequence into at least two scales to obtain an image block set consisting of image blocks at the same position in all image frames at the same scale;

determining correlation characteristics between the image sequences based on the image block sets of the image sequences;

and determining a target image sequence with an abnormal event in the at least two image sequences according to the correlation characteristics between the image sequences.

2. The method of claim 1, wherein determining a correlation characteristic between each of the image sequences based on the set of image blocks of each of the image sequences comprises:

aiming at each image sequence, based on each image block set under the same scale, obtaining a first feature corresponding to the scale; wherein the first feature comprises correlation among image block sets of the same scale;

fusing the first features corresponding to all scales in the same image sequence to obtain a second feature of each image sequence;

determining the correlation feature between the image sequences based on the second feature of the image sequences.

3. The method according to claim 2, wherein obtaining a first feature corresponding to a scale based on each image block set at the same scale includes:

4. The method according to claim 3, wherein the constructing, based on the first mosaicing feature corresponding to the scale, an association relationship between image block sets of the same scale represented by the first mosaicing feature by using an attention mechanism and convolution processing to obtain the first feature corresponding to the scale comprises:

5. The method of claim 4, wherein determining a weight matrix based on the self-attention mechanism and the first stitching characteristic comprises:

performing dimension reduction processing on the first splicing feature to obtain a dimension-reduced first splicing feature;

6. The method of claim 5, wherein obtaining the weighted feature based on the weight matrix and the first stitched feature comprises:

7. The method of claim 4,

the convolving the first splicing feature to obtain the convolved feature includes:

8. The method of claim 4, wherein obtaining the first feature based on the weighted feature, the convolved feature, and the first stitching feature comprises:

9. The method according to claim 3, wherein the performing feature extraction on each image block set under the same scale to obtain features corresponding to the image block sets comprises:

10. The method according to claim 2, wherein the fusing the first features corresponding to the respective scales in the same image sequence to obtain the second feature of each image sequence comprises:

11. The method of claim 2, wherein determining the correlation feature between the image sequences based on the second feature of the image sequences comprises:

and constructing an association relation between different image sequences characterized by the second splicing feature based on the second splicing feature and based on a self-attention mechanism and a hole convolution, and determining the correlation feature between the image sequences.

12. The method according to any one of claims 1 to 11, wherein the determining, from the correlation characteristics between the image sequences, a target image sequence in which an abnormal event exists in the at least two image sequences comprises:

detecting correlation characteristics among the image sequences based on a preset abnormal prediction model to obtain a prediction result of each image sequence; the preset abnormal prediction model is a model obtained by training by adopting a weak supervision training method;

and determining the target image sequence with the abnormal event according to the prediction result of each image sequence.

13. The method of claim 12, further comprising:

14. The method of any one of claims 1 to 11, wherein said acquiring at least two image sequences comprises:

acquiring a video to be detected;

determining a difference value between adjacent frame images in the video to be detected;

and determining the image frame in the front time as the tail frame of the image sequence and the image frame in the back time as the head frame of the image sequence adjacent to the image sequence in the adjacent frame images with the difference value larger than a preset difference threshold value.

15. An abnormal event detection apparatus, comprising:

an acquisition module for acquiring at least two image sequences; wherein each image sequence comprises at least one frame of image;

the dividing module is used for dividing each image sequence by at least two scales to obtain an image block set consisting of image blocks at the same position in all image frames at the same scale;

a first determining module, configured to determine a correlation characteristic between the image sequences based on the image block set of each image sequence;

and the second determining module is used for determining a target image sequence with an abnormal event in the at least two image sequences according to the correlation characteristics between the image sequences.

16. A computer device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the abnormal event detection method of any one of claims 1 to 14.

17. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the abnormal event detection method according to any one of claims 1 to 14.