CN117132919A - Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method - Google Patents

Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method Download PDF

Info

Publication number
CN117132919A
CN117132919A CN202311010650.XA CN202311010650A CN117132919A CN 117132919 A CN117132919 A CN 117132919A CN 202311010650 A CN202311010650 A CN 202311010650A CN 117132919 A CN117132919 A CN 117132919A
Authority
CN
China
Prior art keywords
feature
data
convolution
video
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311010650.XA
Other languages
Chinese (zh)
Inventor
汪炜杰
樊谨
陈淼
陈琪凯
杨勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202311010650.XA priority Critical patent/CN117132919A/en
Publication of CN117132919A publication Critical patent/CN117132919A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, which adopts a parallel network with convolution and self-attention extraction features as a core module, adopts an end-to-end self-encoder architecture in an overall architecture, extracts the expression features and space-time features of a video sequence from a convolution layer and a Transformer attention calculation layer, fuses all the features, combines a memory network to decode and predict the multi-scale features, and finally outputs a predicted video frame. The invention adopts the multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, and uses the end-to-end network for extracting the representation features and the space-time features of the video sequence images, thereby improving the video frame prediction precision, enhancing the diversity of training normal sample feature information and improving the anomaly score assessment of video anomaly detection.

Description

Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method
Technical Field
The invention relates to the technical field of anomaly detection, in particular to a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method.
Background
The video anomaly detection is a task for detecting an anomaly event in a video, and has a profound research target in academia and important application value in industry. Because the abnormal events are defined differently in different scenes, all events which do not accord with normal logic behaviors can be regarded as abnormal behaviors. And the method of processing video is difficult and complex, video anomaly detection tasks are extremely challenging. This task cannot collect all the precisely defined data sets containing the abnormal events and thus cannot solve this problem by standard classification methods. Existing work studies typically use a training normal data model to learn the distribution of normal behavior, with the model determining anomalies by comparing the distribution gap of a given test sample to the training sample during testing.
In recent years, convolutional neural network-based methods are widely used in Auto-Encoder (AE) models for video anomaly detection, which are mainly due to a convolution operation that captures local features as effective image representations in a hierarchical manner, but, despite advantages in local feature extraction, convolutional neural networks have difficulty capturing global features of video, such as spatiotemporal relationship features between successive frames of video, feature dependencies between video positive sample data, and so on. As the Vision Transformer architecture based on the self-attention mechanism has good performance in the field of computer vision, more and more methods apply the self-attention method to the vision fields such as image classification. By means of self-attention mechanisms and multi-layer perceptron structures Vision Transformer exhibit complex spatial sequence relationships and long-range feature dependencies, thus constructing global representation features, vision Transformer has fewer limitations on inductive bias than convolutional neural networks, and although convolutional kernels are designed to capture short-range spatio-temporal information, they cannot model dependency beyond the receptive field. While the deep stacking operation of convolution can expand receptive fields, these methods have certain limitations in capturing remote dependencies by aggregating shorter range means. In contrast, by directly comparing the feature information of all the space-time positions, the experience range of the traditional convolution filtering method can be exceeded by applying an attention mechanism to capture the global dependency relationship. However, vision Transformer ignores local feature details, resulting in the fact that the transducer cannot make sensitive feature concerns about weak changes in video in terms of handling the image type of the frame sequence.
Disclosure of Invention
The invention aims to provide a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, which improves the video frame prediction precision, enhances the diversity of training normal sample feature information and improves the anomaly score evaluation of video anomaly detection.
In order to achieve the above purpose, the invention provides a multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method, which comprises the following steps:
step S1: data preprocessing, namely dividing a test set and a training set video into image frames;
step S2: according to the training data set obtained in the step S1, carrying out batch processing on training set data, and transmitting the training set data into a self-encoder model, wherein each group of data comprises 5 continuous video frames;
step S3: the data are respectively transmitted into a convolution channel and a converter channel coder (Encoder), and the characteristics of the convolution channel and the converter channel coder are fused to be used as expression characteristics and space-time characteristics;
step S4: inputting the features into a Memory Network (Memory Network), and performing normal sample feature Memory storage and updating on the features obtained in the step S3 through Softmax function calculation to obtain diversified normal event data;
step S5: transmitting the updated and stored video features into a Decoder (Decoder) to generate predicted frame data;
step S6: calculating the Mean Square Error (MSE) between the generated predicted frame and the reality obtained in the step S5, carrying out back propagation, and updating network parameters to obtain a trained self-encoder model;
step S7: testing the self-encoder model by verifying the data set;
step S8: calculating a Mean Square Error (MSE) between a predicted frame and a real predicted frame generated based on the verification data set, solving a mean value after solving the MSE of all groups of data, and obtaining the MSE error of the training data set;
step S9: repeating the steps S2 to S8 until the Mean Square Error (MSE) obtained in the step S8 is reduced;
step S10: and loading the network parameter model in the training task into the test task, transmitting the test data set data into the model for abnormal score calculation, and finally obtaining the evaluation score by calculating the area AUC under the ROC curve.
Preferably, step S2 specifically includes:
step S21: setting the size of each group of images, the length of the data sequence, and the width and height data and the sequence step length of each group of images according to the requirements;
step S22: grouping by adopting a sliding window mechanism, wherein the window length is the sequence length, and each time the window moves by one bit;
step S23: after the training data grouping is completed, the same sliding window data processing is carried out on the test set data.
Preferably, the end-to-end self-encoder model in step S3 is composed of a convolutional channel encoder, a transducer channel encoder, a feature fusion module, a memory network, and a Decoder.
Preferably, the specific method of step S3 is as follows:
the encoder is divided into a convolution channel encoder module and a transducer channel encoder module;
the core module of the convolution channel encoder is a convolution neural network, and the N layers of convolution operation is used for carrying out feature extraction processing on the image, and the convolution channel encoder has the following module formula:
wherein F represents an input training sample;representing the representation features extracted by convolution, wherein i represents the convolution feature extraction of the ith layer, BN is a feature normalization function, and ReLu is an activation function; />In order to reduce the features after resolution, the MaxPool function is a maximum pooling function;
the kernel of the transform channel encoder is self-Attention mechanism calculation, firstly, dividing a video sequence frame into n×n image blocks according to frames, normalizing the n×n image blocks, extracting sequence features through sequence Attention (Sequential Attention, SA), performing dimension transformation through the LayerNorm (LN) and a multi-layer perceptron (MLP), then performing frame global feature self-Attention extraction (Attention, AT), and finally splicing and outputting, wherein the overall formula is as follows:
F 1...p ={F 1 ...F p }=Patch(F)
F SA =MLP(LN(SA(LN(F 1...p )))
wherein F is 1...p Representing the image block after F segmentation, F SA Representing the characteristic result after the sequence attention calculation, F t Representing spatiotemporal features connected after self-attention computation and sequential attention computation,representing a connection symbol;
the convolution operation is different from the characteristic dimension of a transducer, the image block dimension in the transducer is (1+L) x E, wherein 1 and L are respectively the identification position information and the number of the image blocks, and E represents the embedding dimension;
in the convolution, the feature dimensions are h×w×c, and correspond to the height, width and channel number, so in the transform channel, first, 1×1 convolution is used to align the space dimensions of the feature information with the corresponding stage of the convolution layer, then, the regularization process is performed on the feature information by using the Batch Norm (BN), finally, the feature information captured by each convolution block in the convolution channel is restored to the corresponding feature resolution by using the Interpolate operation, the dimension of the feature information captured by each convolution block is reduced by using the average pooling operation, and then, the feature dimension conversion and LayerNorm (LN) regularization modes are used to make the feature information identical to the feature dimensions in the transform channel, and the overall conversion formula is as follows:
F′ c =Interpolate(Conv(Reshape(F t )))
wherein F is t Representing spatiotemporal features obtained from a transducer channel, F' c To be the convolution format feature after conversion, F ct Representing the final feature data from the encoder, wherein the Reshape operation represents feature format conversion, the Conv operation samples the input to a given size scale for integrating the feature format, and the Concat is a stitching function.
Preferably, the specific method of the memory network module in step S4 is as follows:
in a memory network, M prototype features for recording data, each item being defined as M t In the process of outputting video frames to an incoming Decoder in a memory network through an Encoder, a Softmax operation is used to obtainComprises M t And encoder feature F ct Relationship between, overall formulaThe following are provided:
preferably, the Mean Square Error (MSE) in step S6 is formulated as follows:
wherein the method comprises the steps ofFor predicted frames, F is the true frame and n represents the video sequence length.
Therefore, the method for detecting the video anomalies by using the multi-scale high-dimensional feature analysis unsupervised learning has the following beneficial effects:
(1) The method has the advantages that the representing characteristics and the space-time characteristics of the video sequence images are extracted, the prediction frame precision is improved, and the positive sample characteristic information of the model is enhanced;
(2) The fitting capability of sequence fluctuation of the video image is improved by using the space-time characteristics and the representation characteristics, the anomaly detection score of the model is increased, and the effect of the model on video anomaly detection is greatly improved.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a schematic diagram of the overall structure of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a convolutional channel encoder according to an embodiment of the present invention;
FIG. 3 is a block diagram of a transducer channel encoder according to an embodiment of the present invention;
FIG. 4 is a block diagram of a feature fusion module according to an embodiment of the present invention;
FIG. 5 is a diagram of a memory network module according to an embodiment of the present invention;
FIG. 6 is a comparison of the AUC scores of an embodiment of the present invention with a plurality of existing methods under four public data sets UCSD Ped1, UCSD Ped2, CHUK Avenue and Shangai Tech.
Detailed Description
Examples
A multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method based on end-to-end architecture comprises the following steps:
step S1: and (3) preprocessing data, namely dividing the video of the test set and the training set into image frames.
Step S2: the training data set obtained in the step 1 is used for carrying out batch processing on training set data and transmitting the training set data into a self-encoder model, each group of data comprises 5 continuous video frames, wherein the first 4 frames are used as characteristic extraction information data for realizing frame prediction, and the last 1 frames are used as real frames and predicted frames for carrying out difference comparison;
the method comprises the following specific steps: and selecting a proper public video anomaly detection data set, and carrying out video frame segmentation on the data set to adapt to the requirements of a model on a data format. Firstly, setting the size of each group of images according to the requirement, and respectively corresponding the width and height data of each group of images and the sequence step length of the data sequence. The sliding window mechanism is adopted for grouping, the window length is the sequence length, and each time the window moves by one bit, namely, only one bit of difference between two adjacent groups of data is adopted. After the training data packet is completed, the same sliding window data processing is performed on the test set data.
As shown in fig. 1, the overall structure of the present invention is shown. The data processing and dividing part is at the entrance of the structure of the invention and is responsible for carrying out preliminary processing on the original data to form a data structure required by the prediction model.
Step S3: the data are respectively transmitted into a convolution channel and a converter channel coder (Encoder), and the two characteristics are fused after convolution calculation and self-attention calculation to be the final performance characteristic and the space-time characteristic;
the end-to-end self-encoder model in step S3 is composed of a convolutional channel encoder, a transducer channel encoder, a feature fusion module, a memory network, and a Decoder. The convolution channel encoder needs to input each group of image data, and uses convolution operation to extract the expression characteristics of the image; the transducer channel encoder needs to divide each group of input data into a plurality of image blocks, and uses a self-attention mechanism to perform attention calculation on each image block; features extracted from the two channels are subjected to feature splicing through a feature fusion module to form complete features with performance and space time; the memory network is responsible for receiving the extracted features and storing the features into the memory block, and updating the latest memory features; the decoder analyzes and reconstructs the characteristics, the real frames in each group of data are used as correct results to be compared with the generated predicted frames finally output by the model, and the errors between the two frames are calculated.
As shown in fig. 2, which shows the overall structure of the convolution feature extraction method of the present invention, the convolution encoder receives the image sequence in each set of data obtained in step S2, and uses N-layer convolution operation to perform feature extraction processing on the image, and the convolution channel encoder module formula is as follows:
wherein F represents an input training sample;representing the representation features extracted by convolution, wherein i represents the convolution feature extraction of the ith layer, BN is a feature normalization function, and ReLu is an activation function; />In order to reduce the features after resolution, the MaxPool function is a maximum pooling function; setting a two-dimensional sliding window, sliding one step at a time, then obtaining the maximum value of all image pixel values in the window, and dividing the obtained value into new pixel units.
As shown in fig. 3, which illustrates the overall structure of the inventive transducer feature extraction, the core module of the transducer channel encoder is self-attention mechanism computation for performing the spatio-temporal relationship between video sequences. Firstly, dividing a video sequence frame into N multiplied by N image blocks according to the frame, carrying out LayerNorm (LN) normalization on the image blocks, extracting sequence characteristics through sequence Attention (Sequential Attention, SA), carrying out dimension transformation through LayerNorm (LN) and a multi-layer perceptron (MLP), then carrying out frame global characteristic self-Attention extraction (Attention, AT), and finally splicing and outputting, wherein the overall formula is as follows:
F 1...p ={F 1 ...F p }=Patch(F)
F SA =MLP(LN(SA(LN(F 1...p )))
wherein F is 1...p Representing the image block after F segmentation, F SA Representing the characteristic result after the sequence attention calculation, F t Representing spatiotemporal features connected after self-attention computation and sequential attention computation,representing a connection symbol;
as shown in fig. 4, a feature fusion module is shown for fusing convolutions with features of different scales extracted from a transducer. The convolution operation is different from the characteristic dimension of a transducer, the image block dimension in the transducer is (1+L) x E, wherein 1 and L are respectively the identification position information and the number of the image blocks, and E represents the embedding dimension;
in the convolution, the feature dimensions are h×w×c, and correspond to the height, width and channel number, so in the transform channel, first, 1×1 convolution is used to align the space dimensions of the feature information with the corresponding stage of the convolution layer, then, the regularization process is performed on the feature information by using the Batch Norm (BN), finally, the feature information captured by each convolution block in the convolution channel is restored to the corresponding feature resolution by using the Interpolate operation, the dimension of the feature information captured by each convolution block is reduced by using the average pooling operation, and then, the feature dimension conversion and LayerNorm (LN) regularization modes are used to make the feature information identical to the feature dimensions in the transform channel, and the overall conversion formula is as follows:
F′ c =Interpolate(Conv(Reshape(F t )))
wherein F is t Representing spatiotemporal features obtained from a transducer channel, F' c To be the convolution format feature after conversion, F ct Representing the final feature data from the encoder, wherein the Reshape operation represents feature format conversion, the Conv operation samples the input to a given size scale for integrating the feature format, and the Concat is a stitching function.
Step S4: inputting the final features into a Memory Network (Memory Network), and performing normal sample feature Memory storage and updating on the features finally obtained in the step S3 through Softmax function calculation to obtain diversified normal event data;
as shown in fig. 5, the memory network module mainly has a Softmax computing and updating memory function. In a memory network, M prototype features for recording data, each item being defined as M t In the process of outputting video frames to an incoming Decoder in a memory network through an Encoder, a Softmax operation is used to obtainComprises M t And encoder feature F ct The relation between the two is as follows:
step S5: the video features updated and stored through the memory network are transmitted into a Decoder (Decoder), and the next frame of the continuous video frame is reconstructed through deconvolution and residual connection operation, so that predicted frame data is generated.
Step S6: and (5) obtaining a final Mean Square Error (MSE) between the generated predicted frame and the reality in the calculation step, then carrying out back propagation through an Adam optimizer, updating network parameters, and finally obtaining a trained self-encoder model.
Step S7: the self-encoder model is tested by validating the data set.
Step S8: and calculating a Mean Square Error (MSE) between the predicted frame generated based on the verification data set and the real predicted frame, solving the Mean Square Error (MSE) of all groups of data, and obtaining the MSE error of the training data set.
Step S9: and repeating the steps S2 to S8 until the Mean Square Error (MSE) obtained in the step S8 is not greatly reduced, which indicates that the model training is basically finished and the network parameters are updated.
Step S10: and loading the network parameter model in the training task into the test task, transmitting the test data set data into the model for abnormal score calculation, and finally obtaining the evaluation score by calculating the area AUC under the ROC curve.
Fig. 6 shows experimental results of various methods on four public data sets of UCSD Ped1, UCSD Ped2, CHUK Avenue and Shanghai Tech under the same experimental conditions, wherein the measurement standard is the Area (AUC) enclosed by the coordinate axis under the ROC curve, the ROC represents the connection line of each point drawn under the specific stimulus condition by taking the false alarm probability P (y/N) obtained under different judgment standards as the abscissa and the hit probability P (y/SN) as the ordinate. The experimental results of the best performing model are bolded in the table for each experimental condition. From fig. 6, it can be seen that the multi-scale high-dimensional feature analysis video anomaly detection method based on the end-to-end architecture has a larger improvement than the rest methods, is inferior to the STAE method in the UCSD Ped1 dataset, and achieves the best effect in the UCSD Ped2, CHUK Avenue and Shanghai Tech datasets.
Therefore, the method for detecting the video anomalies by the multi-scale high-dimensional feature analysis unsupervised learning improves the accuracy of video frame prediction, enhances the diversity of training normal sample feature information and improves the anomaly score assessment of video anomaly detection.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims (6)

1. The method for detecting the abnormal of the unsupervised learning video by the multi-scale high-dimensional feature analysis is characterized by comprising the following steps of:
step S1: data preprocessing, namely dividing a test set and a training set video into image frames;
step S2: according to the training data set obtained in the step S1, carrying out batch processing on training set data, and transmitting the training set data into a self-encoder model, wherein each group of data comprises 5 continuous video frames;
step S3: the data are respectively transmitted into a convolution channel and a transducer channel encoder, and the characteristics of the convolution channel and the transducer channel encoder are fused to be used as the expression characteristics and the space-time characteristics;
step S4: inputting the features into a memory network, and performing normal sample feature memory storage and updating on the features obtained in the step S3 through Softmax function calculation to obtain diversified normal event data;
step S5: transmitting the updated and stored video characteristics into a decoder to generate predicted frame data;
step S6: calculating the mean square error between the generated predicted frame and the true obtained in the step S5, carrying out back propagation, and updating network parameters to obtain a trained self-encoder model;
step S7: testing the self-encoder model by verifying the data set;
step S8: calculating the mean square error between the predicted frame and the real predicted frame generated based on the verification data set, solving the mean square error of all groups of data, and then solving the mean square error of the training data set;
step S9: repeating the steps S2 to S8 until the mean square error reduction amplitude obtained in the step S8 is reduced;
step S10: and loading the network parameter model in the training task into the test task, transmitting the test data set data into the model for abnormal score calculation, and finally obtaining the evaluation score by calculating the area AUC under the ROC curve.
2. The method for detecting the abnormal condition of the unsupervised learning video by the multi-scale high-dimensional feature analysis according to claim 1, wherein the step S2 specifically comprises:
step S21: setting the size of each group of images, the length of the data sequence, and the width and height data and the sequence step length of each group of images according to the requirements;
step S22: grouping by adopting a sliding window mechanism, wherein the window length is the sequence length, and each time the window moves by one bit;
step S23: after the training data grouping is completed, the same sliding window data processing is carried out on the test set data.
3. The method for detecting the abnormal condition of the unsupervised learning video through the multi-scale high-dimensional feature analysis according to claim 1, wherein the method comprises the following steps of: the end-to-end self-encoder model described in step S3 is composed of a convolutional channel encoder, a transducer channel encoder, a feature fusion module, a memory network, and a Decoder.
4. The method for detecting the abnormal condition of the unsupervised learning video by the multi-scale high-dimensional feature analysis according to claim 1, wherein the specific method in the step S3 is as follows:
the encoder is divided into a convolution channel encoder module and a transducer channel encoder module;
the core module of the convolution channel encoder is a convolution neural network, and the N layers of convolution operation is used for carrying out feature extraction processing on the image, and the convolution channel encoder has the following module formula:
wherein F represents an input training sample;representing the representation features extracted by convolution, wherein i represents the convolution feature extraction of the ith layer, BN is a feature normalization function, and ReLu is an activation function; />In order to reduce the features after resolution, the MaxPool function is a maximum pooling function;
the kernel module of the transform channel encoder is self-attention mechanism calculation, firstly, video sequence frames are divided into N multiplied by N image blocks according to frames, layerNorm normalization is carried out on the video sequence frames, sequence features are extracted through sequence attention, dimension transformation is carried out through LayerNorm and a multi-layer perceptron, then frame global feature self-attention extraction is carried out, and finally, splicing output is carried out, wherein the overall formula is as follows:
F 1...p ={F 1 ...F p }=Patch(F)
F SA =MLP(LN(SA(LN(F 1...p )))
wherein F is 1...p Representing the image block after F segmentation, F SA Representing the characteristic result after the sequence attention calculation, F t Representing spatiotemporal features connected after self-attention computation and sequential attention computation,representing a connection symbol;
the image block scale in the transducer is (1+L) x E, wherein 1 and L are respectively the identification position information and the number of the image blocks, and E represents the embedding dimension;
in the convolution, the characteristic dimension is H×W×C, and the corresponding dimensions are height, width and channel number, 1×1 convolution is first used in the transducer channel, the characteristic information is subjected to spatial scale alignment corresponding to the stage of the convolution layer, then the BatchNorm is used for regularization treatment, and finally the Inter is used p The method comprises the steps of performing the operation of the resolution to restore the feature to the corresponding feature resolution, reducing the dimension of the feature information captured by each convolution block in a convolution channel by using the average pooling operation, and enabling the feature information to be identical to the feature dimension in a transform channel in a mode of feature dimension transformation and LayerNorm regularization, wherein the overall transformation formula is as follows:
F′ c =Interpolate(Conv(Reshape(F t )))
wherein F is t Representing spatiotemporal features obtained from a transducer channel, F' c To be the convolution format feature after conversion, F ct Representing the final feature data from the encoder, wherein the Reshape operation represents feature format conversion, the Conv operation samples the input to a given size scale for integrating the feature format, and the Concat is a stitching function.
5. The method for detecting the abnormal condition of the unsupervised learning video based on the multi-scale and high-dimensional feature analysis according to claim 1, wherein the specific method for the memory network module in the step S4 is as follows:
in a memory network, M prototype features for recording data, each item being defined as M t In the process of outputting video frames to an incoming Decoder in a memory network through an Encoder, a Softmax operation is used to obtainComprises M t And encoder feature F ct The relation between the two is as follows:
6. the method for detecting the abnormal condition of the unsupervised learning video by the multi-scale high-dimensional feature analysis according to claim 1, wherein the mean square error formula in the step S6 is as follows:
wherein the method comprises the steps ofFor predicted frames, F is the true frame and n represents the video sequence length.
CN202311010650.XA 2023-08-11 2023-08-11 Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method Pending CN117132919A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311010650.XA CN117132919A (en) 2023-08-11 2023-08-11 Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311010650.XA CN117132919A (en) 2023-08-11 2023-08-11 Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method

Publications (1)

Publication Number Publication Date
CN117132919A true CN117132919A (en) 2023-11-28

Family

ID=88850102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311010650.XA Pending CN117132919A (en) 2023-08-11 2023-08-11 Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method

Country Status (1)

Country Link
CN (1) CN117132919A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556311A (en) * 2024-01-11 2024-02-13 电子科技大学 Unsupervised time sequence anomaly detection method based on multidimensional feature fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556311A (en) * 2024-01-11 2024-02-13 电子科技大学 Unsupervised time sequence anomaly detection method based on multidimensional feature fusion
CN117556311B (en) * 2024-01-11 2024-03-19 电子科技大学 Unsupervised time sequence anomaly detection method based on multidimensional feature fusion

Similar Documents

Publication Publication Date Title
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN108805015B (en) Crowd abnormity detection method for weighted convolution self-coding long-short term memory network
CN112287816B (en) Dangerous work area accident automatic detection and alarm method based on deep learning
CN111738363B (en) Alzheimer disease classification method based on improved 3D CNN network
WO2022083335A1 (en) Self-attention mechanism-based behavior recognition method
CN105138973A (en) Face authentication method and device
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN111832484A (en) Loop detection method based on convolution perception hash algorithm
CN110930378B (en) Emphysema image processing method and system based on low data demand
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN111738054A (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN116824239A (en) Image recognition method and system based on transfer learning and ResNet50 neural network
CN117132919A (en) Multi-scale high-dimensional feature analysis unsupervised learning video anomaly detection method
CN115131313A (en) Hyperspectral image change detection method and device based on Transformer
CN116563243A (en) Foreign matter detection method and device for power transmission line, computer equipment and storage medium
CN115546223A (en) Method and system for detecting loss of fastening bolt of equipment under train
CN114821299A (en) Remote sensing image change detection method
CN111242028A (en) Remote sensing image ground object segmentation method based on U-Net
CN116994175A (en) Space-time combination detection method, device and equipment for depth fake video
CN114387524B (en) Image identification method and system for small sample learning based on multilevel second-order representation
CN109800719B (en) Low-resolution face recognition method based on sparse representation of partial component and compression dictionary
CN114140524A (en) Closed loop detection system and method for multi-scale feature fusion
CN113537240A (en) Deformation region intelligent extraction method and system based on radar sequence image
CN112464989A (en) Closed loop detection method based on target detection network
CN111696070A (en) Multispectral image fusion power internet of things fault point detection method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination