CN115439926A

CN115439926A - Small sample abnormal behavior identification method based on key region and scene depth

Info

Publication number: CN115439926A
Application number: CN202210936032.7A
Authority: CN
Inventors: 肖进胜; 吴原顼; 眭海刚; 姚韵涛; 王中元; 王澍瑞
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2022-12-06

Abstract

The invention discloses a small sample abnormal behavior identification method based on key areas and scene depths. The method comprises the steps of firstly carrying out random sparse sampling on a video, carrying out global feature extraction on a sampled video frame, then inputting a global feature map into a key region selection module based on weighted offset, obtaining a key region containing an abnormal behavior body, carrying out local feature extraction, then fusing the global feature and the local feature to obtain video-level RGB features, then extracting a corresponding scene depth map from the video frame, repeating the feature extraction step on the scene depth map to obtain video-level scene depth features, finally fusing the video-level RGB features and the scene depth features to obtain final video-level features, and inputting the video-level features into a small sample classifier to obtain an abnormal behavior identification result. The method and the device aim at identifying the abnormal behavior of the monitoring scene, improve the accuracy, the calculation efficiency and the robustness, and are suitable for the conditions of monitoring videos with multiple moving targets and complex backgrounds.

Description

Small sample abnormal behavior identification method based on key region and scene depth

Technical Field

The invention belongs to the technical field of video image processing, and particularly relates to a small sample abnormal behavior identification method based on key areas and scene depths.

Background

The abnormal behavior recognition realizes intelligent recognition of abnormal behaviors in the monitoring video by utilizing the technologies in the fields of deep learning and the like. In recent years, many social safety hazards which have great influence occur in public places such as railway stations, subway stations, campuses and the like, and how to effectively maintain public safety becomes a focus of social attention. The monitoring camera is used for identifying abnormal behaviors in time and giving early warning, and the method is an important means for maintaining public safety, however, the traditional method for identifying the abnormal behaviors by manpower is easy to cause false detection and missing detection due to fatigue caused by long-time work, and therefore, the computer is used for realizing automatic intelligent identification of the abnormal behaviors in the monitoring video.

The identification of abnormal behaviors has the major difficulties that the abnormal behaviors have small occurrence probability and small number of abnormal samples compared with normal behaviors. Aiming at the difficulty, a concept of small sample abnormal behavior recognition is introduced, namely, the abnormal behavior recognition model has the capacity of recognizing brand new abnormal behavior categories with few samples through small sample learning, and the problem of small number of abnormal samples is solved. Small sample learning is generally based on the principle of meta-learning, i.e. learning some commonality from other large amounts of data, for identifying new classes. The conventional small sample learning model is usually based on a measurement method, namely, the distance between samples is modeled, the distance between the samples of the same type is close, the distance between the samples of different types is increased, the similarity between the samples is judged according to the distance, and the type of an unknown sample is further judged. In addition, the identification of the abnormal behavior also has the difficulty that the abnormal behavior is complex, namely the expression forms of the abnormal behavior in different monitoring scenes and different behavior subjects are different.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a small sample abnormal behavior identification method based on a key area and scene depth, and aims to improve the accuracy, the calculation efficiency and the robustness of abnormal behavior identification aiming at a monitoring scene.

In order to achieve the purpose, the technical scheme provided by the invention is a small sample abnormal behavior identification method based on a key area and scene depth, which comprises the following steps:

step 1, randomly sampling sparsely a video, dividing the video into N sections according to the number of frames, randomly sampling M frames in each section, and taking the total N multiplied by M frames as a representative of the video;

step 2, performing feature extraction on the video frame generated in the step 1 by using a global feature extraction network to obtain a two-dimensional global feature map and a one-dimensional global feature vector;

step 3, performing weighted offset-based key region selection on the two-dimensional global feature map extracted in the step 2 to obtain a key region containing abnormal behavior bodies in a video frame, and extracting one-dimensional local feature vectors of the key region by using a local feature extraction network;

step 3.1, performing space-time feature extraction and motion feature extraction on the two-dimensional global feature map extracted in the step 2 to generate a feature map with space-time information and object motion information;

3.2, selecting the central point of the key area based on the weighted deviation;

3.3, obtaining pixel values of other points in the key area by utilizing bilinear interpolation;

step 3.4, inputting the key area into a local feature extraction network to obtain the local feature of the key area;

step 4, fusing the global feature vector extracted in the step 2 and the local feature vector extracted in the step 3 to obtain a video-level RGB feature vector;

step 5, processing the NxM frames generated in the step 1 by using a monocular scene depth estimation model to obtain a corresponding NxM frame scene depth map;

step 6, repeating the operations from the step 2 to the step 4 on the scene depth map extracted in the step 5 to obtain a video-level scene depth feature vector;

step 7, fusing the video level RGB feature vector extracted in the step 4 and the video level scene depth feature vector extracted in the step 6 to obtain a final video level feature vector;

and 8, inputting the video-level feature vector obtained in the step 7 into a small sample classifier to obtain a final abnormal behavior identification result.

In step 1, firstly, extracting a video into continuous video frames through ffmpeg software, counting the number of the video frames through an os library, equally dividing the video frames into N parts, and randomly extracting M frames from each part; the nxm frames are then further processed through the PIL library: if the width or height of the video frame is less than a, adjusting the size of the shorter side to be a and the size of the longer side to be b; when the video frame is used for training, random position cutting is carried out on the video frame, the cutting size is a multiplied by a, and random vertical turning is carried out with the probability of 50%; when abnormal behaviors in a video frame are predicted, only the center position of the video frame is cut, and the cutting size is a multiplied by a; and finally, carrying out vectorization on the NxM frames and then carrying out normalization to obtain a representative video.

And in the step 2, the NxM video frame data obtained in the step 1 is input into a Resnet-50 network for feature extraction, a global feature map is obtained from the output of the last convolutional layer of the Resnet-50 network, a global feature vector is obtained from the input of an average pooling layer, and the Resnet-50 loads parameters obtained from the kinetic video data set through pre-training.

Moreover, said step 3.1 comprises the following steps:

step 3.1.1, channel mean normalization is carried out on the input video frame feature map, namely, the input multiple channels are averaged to obtain single-channel output;

step 3.1.2, performing space-time feature extraction on the normalized feature map obtained in the step 3.1.1;

firstly, data reconstruction is carried out, the time dimension and the channel dimension of a video frame feature graph are exchanged through the data reconstruction, then three-dimensional convolution is input, the three-dimensional convolution network can extract the spatio-temporal information of a video frame, then the data reconstruction is carried out again to recover the dimension, namely, the time dimension and the channel dimension are exchanged again, and finally the normalization is carried out through a Sigmoid function to obtain a spatio-temporal feature graph;

step 3.1.3, extracting the motion characteristics of the normalized space-time characteristic diagram obtained in the step 3.1.1;

firstly, time dispersion is carried out, namely, feature maps represented by continuous video frames are separated to obtain feature maps respectively represented by each frame, then, the feature maps respectively represented by each frame are respectively input into two-dimensional convolution to extract spatial features, and the difference of two-dimensional convolution output of each frame and the adjacent next frame is solved, namely:

X _out ＝K*X _t+1 -X _t (1)

where K represents a parameter learned by training of the two-dimensional convolution, X _t+1 、X _t Respectively representing the characteristic graphs of the t +1 th frame and the t th frame;

finally, connecting all the obtained difference values and normalizing the difference values through a Sigmoid function to obtain a motion characteristic graph;

step 3.1.4, adding the feature maps output in the step 3.1.2 and the step 3.1.3 and the feature map generated in the step 3.1.1 by using a residual error structure to obtain a feature map with space-time information and motion information;

and 3.1.5, performing two-dimensional softmax operation on the feature map obtained in the step 3.1.4, namely inputting elements of each row and each column on the two-dimensional feature map into a softmax function, so that all the elements on the two-dimensional feature map are added to be 1.

In step 3.2, L points are uniformly selected from the original image

Pointing each point a with the center point of the original drawing _i Obtaining an offset vector

Element u on the feature map extracted in step 3.1 _i As weights, all the offset vectors are weighted and summed to obtain a sum vector

And are provided with

The pointed point is the central point of the key area, and the method comprises the following steps:

step 3.2.1, selecting a central point of the key area in a square area with the side length being the difference between the side length of the original image and the side length of the key area and being positioned at the center of the original image;

step 3.2.2, uniformly taking L points from the boundary of the square area selected in the step 3.2.1

The number L of the points is the same as the number of elements on the feature map with the spatio-temporal information and the motion information extracted in the step 3.1, and the points respectively correspond to the elements on the feature map from left to right and from top to bottom in the space;

step 3.2.3, the central point of the original picture is taken to point to each point a _i Is an offset vector

Step 32.4, corresponding element u on the characteristic diagram _i As weights for offset vectors, displacement vectors

Summing after weighting to obtain sum vector

Sum vector

The pointed point is the center point of the key area.

Furthermore, the step 3.3 firstly passes through the center point of the key area

Translating to obtain other points on the key area

I.e.:

in the formula (I), the compound is shown in the specification,

coordinates representing offset vectors of other points on the key area relative to the central point;

then, each point on the key area is obtained by a bilinear interpolation method

Pixel value of

The concrete formula is as follows:

in the formula (I), the compound is shown in the specification,

is composed of

(m) coordinates of four neighborhood points of (c), (m) _ij ) ₀₀ ,(m _ij ) ₀₁ ,(m _ij ) ₁₀ ，(m _ij ) ₁₁ The pixel values of the corresponding four neighborhood points.

In step 3.4, the local feature extraction network uses a Resnet-50 model pre-trained on a Kinetics video data set, and changes the original mean pooling layer with the fixed step length of 7 of the Resnet-50 into an adaptive mean pooling layer, so as to realize feature extraction of a key area with a relatively small size by the local feature extraction network.

And in the step 4, the global feature vector and the local feature vector are connected end to end, and the dimension of the fused video-level RGB feature vector is the sum of the dimension of the global feature vector and the dimension of the local feature vector.

In step 5, video frames are input into a monocular scene depth estimation model for the self-supervised learning one by one, and the model uses a combination of a lightweight U2net model encoder and a decoder for scene depth estimation; combining a scene depth estimation model with a pose estimation model for self-supervision training, wherein the pose estimation model uses a lightweight U2net model encoder, the self-supervision training calculates errors by reconstructing images and trains the models, and the specific process comprises estimating a relative pose transformation matrix T between a target image and a source image by using the pose estimation model _t→t′ Predicting a scene depth map D of the target image using the scene depth estimation model _t The encoder weight of the scene depth estimation model is shared with a light-weight U2net encoder in the attitude estimation model; assuming camera parameters in each image are unchanged, passing through a target image scene depth map D _t And relative pose transformation matrix T _t→t′ And a camera intrinsic internal reference matrix M for calculating a reconstructed target image K from the source image _t′→t The calculation formula is as follows:

K _t′→t ＝K _t′ [proj(D _t ,T _t→t′ ,M)] (4)

in the formula, K _t′→t Calculating to obtain a reconstructed target image for the source image, K _t′ For a source image, proj () represents a scene depth map D _t Projected to a source image K _t′ M is the intrinsic internal reference matrix of the camera, D _t Scene depth map, T, being a target image _t→t′ Is a relative pose transformation matrix between the target image and the source image]Representing a sampling operation.

Calculating a reconstructed target image K _t′→t With the actual target image K _t L1 distance therebetween to obtain a reconstruction error L _p The calculation formula is as follows:

where pe () is the photometric reconstruction error, i.e., the L1 distance in pixel space.

Minimizing reconstruction error L using gradient descent method _p Optimizing scene depth model parameters, performing data enhancement on a source image and a target image in the optimization process, specifically randomly selecting an area which is cut out from the center of the original image to be 1/2 or 1/4 of the size of the original image as new training data, inputting the NxM video frames obtained in the step 1 into the optimized model one by one, and generating the NxM frame scene depth map.

Moreover, the fusion mode of the RGB features and the scene depth features in step 7 is a self-adaptive fusion mode, and the specific process is to normalize the RGB input feature vectors to make the mean value of the RGB input feature vectors 0 and the standard deviation of the RGB input feature vectors 1, and then change the mean value and the standard deviation by using the scene depth information, and the calculation mode is as follows:

wherein f (T) _rgb ,T _d ) For the final video level feature vector, T _rgb 、T _d Respectively representing RGB eigenvector and scene depth eigenvector, parameter fc _s And fc _b Learning from a full connection layer network, mu (T) _rgb )、σ(T _rgb ) Respectively representing the mean and standard deviation of the RGB feature vectors in L dimensions.

In step 8, a small sample classifier, i.e., a prototype network, is trained, a monitoring video abnormal behavior data set is used as training data, the training data is divided into a support set and a prediction set, N classes are randomly extracted from the data set, each class includes K samples, and N × K data are input as the support set; extracting a batch of samples from the residual data in the N classes to be used as a prediction set of the model; when the unknown abnormal behavior is predicted, the form of input data is the same as that during training, and the model learns to judge the label of a prediction set sample through a support set; the characteristics of the support set and prediction set samples are firstly normalized by L2, and the L2 normalization formula of a vector X is as follows:

secondly, a trainable coding network is used, the coding network consists of two full-connection layers and a middle activation function Relu, the input vector dimension of the first full-connection layer is 4096, the output vector dimension is 4096, the input vector dimension of the second full-connection layer is 4096, and the output vector dimension is 1024; then, calculating the mean value of each class of K samples in the support set, and taking the mean value as the prototype B of each class of samples _i Calculating the prediction set sample A and each prototype B _i The cosine similarity of (a) is given by the formula:

and obtaining the normalized probability of all cosine similarity through a softmax function, judging the type of the abnormal behavior according to the probability, and taking the abnormal behavior with the maximum corresponding probability as a final recognition result.

Compared with the prior art, the invention has the following advantages:

the global characteristics of the abnormal behaviors and the local characteristics of key areas containing the main bodies of the abnormal behaviors are fused, and the RGB characteristics and the scene depth characteristics containing the moving objects and the background information are fused, so that the accuracy and the calculation efficiency of the abnormal behavior identification are improved, and the robustness of the monitoring video with multiple moving objects and complex backgrounds is realized.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a structural diagram of a spatiotemporal feature extraction module and a motion feature extraction module according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a weighted-offset-based critical area selection process according to an embodiment of the present invention.

Fig. 4 is a flowchart of scene depth map extraction according to an embodiment of the present invention.

FIG. 5 is a flow chart of a small sample classifier according to an embodiment of the present invention.

Detailed Description

The invention provides a small sample abnormal behavior identification method based on a key area and scene depth, and the technical scheme of the invention is further explained by combining drawings and embodiments.

As shown in fig. 1, the process of the embodiment of the present invention includes the following steps:

step 1, performing random sparse sampling on a video, dividing the video into N sections according to the number of frames, randomly sampling M frames in each section, and taking the total N multiplied by M frames as a representative of the video.

Extracting a video into continuous video frames through ffmpeg software, counting the number of the video frames through an os library, equally dividing the video frames into N parts (taking 4 as N in the embodiment), and randomly extracting M frames from each part (taking 2 as M in the embodiment). The nxm frames are further processed through the PIL library: if the video frame has one side (width or height) less than 224, the shorter side is resized to 224 and the longer side to 256. When the video frame is used for training, randomly cutting the video frame to 224 multiplied by 224, and randomly vertically turning the video frame with 50% probability; when abnormal behaviors in the video frame are predicted, only the center position of the video frame is cut, and the cutting size is 224 multiplied by 224. And carrying out vectorization on the N multiplied by M frames and then carrying out normalization to obtain a representative video.

And 2, performing feature extraction on the video frame generated in the step 1 by using a global feature extraction network to obtain a two-dimensional global feature map and a one-dimensional global feature vector.

Inputting the NxM video frame data obtained in the step 1 or the NxM scene depth data obtained in the step 5 into a Resnet-50 network for feature extraction, obtaining a global feature map from the output of the last convolutional layer of the Resnet-50 network, and obtaining a global feature vector from the input of an average pooling layer. Resnet-50 loads parameters pre-trained from the Kinetics video data set.

And 3, performing weighted offset-based key region selection on the two-dimensional global feature map extracted in the step 2 to obtain a key region containing abnormal behavior bodies in the video frame, and extracting a one-dimensional local feature vector of the key region by using a local feature extraction network.

And 3.1, performing space-time feature extraction and motion feature extraction on the two-dimensional global feature map extracted in the step 2 to generate a feature map with space-time information and object motion information.

And 3.1.1, performing channel mean normalization on the input video frame characteristic diagram, namely averaging a plurality of input channels to obtain single-channel output.

And 3.1.2, performing space-time feature extraction on the normalized feature map obtained in the step 3.1.1.

Firstly, data reconstruction is carried out, and dimension conversion is realized through the data reconstruction. The input form of the feature map of the continuous video frame is NXTXCXHXW, wherein N is the number of samples in a batch, T is the time dimension of the video frame, C represents the channel dimension, H is the video frame height, and W is the video frame width. The video frame feature map time dimension and channel dimension are interchanged so that the convolution can then handle both the time dimension and the two spatial dimensions. And then inputting three-dimensional convolution, wherein the three-dimensional convolution network can extract the spatio-temporal information of the video frame, then reconstructing data again to recover the dimensionality, namely, interchanging the time dimensionality and the channel dimensionality again, and finally normalizing through a Sigmoid function to obtain a spatio-temporal feature map.

And 3.1.3, extracting the motion characteristics of the normalized space-time characteristic diagram obtained in the step 3.1.1.

X _out ＝K*X _t+1 -X _t (1)

where K represents a parameter learned by training of the two-dimensional convolution, X _t+1 、X _t Respectively representing the characteristic graphs input by the t +1 th frame and the t th frame;

and finally, connecting all the obtained difference values and carrying out normalization through a Sigmoid function to obtain a motion characteristic diagram.

And 3.1.4, adding the characteristic diagrams output in the steps 3.1.2 and 3.1.3 and the characteristic diagram generated in the step 3.1.1 by using a residual error structure to obtain a characteristic diagram with space-time information and motion information.

And 3.2, selecting the central point of the key area based on the weighted deviation.

Uniformly taking L points from the original image (the video frame obtained in step 1 or the scene depth map obtained in step 5)

Element u on the feature map extracted in step 3.1 _i As weights, for all offsetsThe quantities are weighted and summed to obtain a sum vector

The pointed point is the center point of the key area.

And 3.2.1, as shown in fig. 3 (a), considering that the intercepted key area has a certain size, and the central point of the key area can only be selected in a certain range in the original image, namely, the side length is the difference between the side length of the original image and the side length of the key area and is positioned in the square area at the center of the original image.

Step 3.2.2, as shown in FIG. 3 (b), take L points uniformly from the boundary for the square area selected in step 3.2.1

The number L of points is the same as the number of elements on the feature map with spatio-temporal information and motion information extracted in step 3.1, and each point corresponds to each element on the feature map respectively according to the sequence from left to right and from top to bottom in space.

Step 3.2.3, as shown in FIG. 3 (c), the central point of the original image is taken to point to each point a _i Is an offset vector

Step 3.2.4, as shown in FIG. 3 (d), with corresponding element u on the feature map _i As weights for offset vectors, displacement vectors

Weighted and summed to obtain the final

Sum vector

The pointed point is the center point of the key area.

And 3.3, obtaining pixel values of other points in the key area by utilizing bilinear interpolation.

Since the size of the key area is fixed, the center point of the key area is determined

Then, the whole key area can be determined, and the center point of the key area is determined

Translating to obtain other points on the key area

I.e.:

in the formula (I), the compound is shown in the specification,

coordinates representing the offset vectors of other points on the critical area with respect to the center point.

Since the coordinates of the center point of the key area obtained by the weighted deviation in the step 3.2 have decimal numbers, the coordinates of the points on the key area are not integers, and no corresponding pixel value corresponds to the original image, a bilinear interpolation method is adopted, namely, each point on the key area is corresponding to the pixel value

Pixel value of

All obtained from the nearest four points on the original image adjacent to the points, and the specific formula is as follows:

in the formula (I), the compound is shown in the specification,

is composed of

(ii) coordinates of the four neighborhood points of (m) _ij ) ₀₀ ,(m _ij ) ₀₁ ,(m _ij ) ₁₀ ，(m _ij ) ₁₁ The pixel values of the corresponding four-neighborhood points. Because the coordinates of all points on the area are continuous rather than discrete, the above mode can find accurate key areas for abnormal behavior subjects at different positions.

And 3.4, inputting the key area into a local feature extraction network to obtain the local feature of the key area.

The local feature extraction network also uses a Resnet-50 model pre-trained on a Kinetics video data set, and is different from the global feature extraction network in that a mean pooling layer of the Resnet-50 with the original fixed step length of 7 is changed into an adaptive mean pooling layer so as to realize feature extraction of a key area with a relatively small size by the local feature extraction network.

And 4, fusing the global feature vector extracted in the step 2 and the local feature vector extracted in the step 3 to obtain a video-level RGB feature vector.

The local feature fusion and the global feature fusion use a connection mode, specifically, a global feature vector and a local feature vector are connected end to end, and the fused video level RGB feature vector dimension is the sum of the global feature vector dimension and the local feature vector dimension.

And 5, processing the NxM frames generated in the step 1 by using the monocular scene depth estimation model to obtain a corresponding NxM frame scene depth map.

Video frames are input into a monocular scene depth estimation model for self-supervised learning one by one, and the model carries out scene depth estimation by using a combination of a lightweight U2net model encoder and a decoder. The scene depth estimation model and another model combination which uses a light-weight U2net encoder for attitude estimation are subjected to self-supervision training, data enhancement is carried out on an input source image and an input target image during training, the extracted scene depth image is robust to noise, and meanwhile scene and target information can be effectively represented.

The self-supervision training is used for training a model by calculating errors through image reconstruction, and the specific process comprises the step of estimating a relative pose transformation matrix T between a target image and a source image through the use of a posture estimation model _t→t′ The pose estimation model uses a lightweight U2net model encoder. Compared with a general U2net model, the number of input channels, the number of intermediate channels and the number of output channels in each residual U structure block in the light-weight U2net model are fewer, so that the space occupied by the model is smaller, and the calculation speed is higher. Then, a scene depth map D of the target image is predicted by using the scene depth estimation model _t The scene depth estimation model uses a lightweight U2net encoder and decoder combination, where the encoder weights are shared with the lightweight U2net encoder in the pose estimation model. Depth map D of scene through target image _t And relative pose transformation matrix T _t→t′ And a camera intrinsic internal reference matrix M (assuming that camera internal reference in each image is constant), and a reconstructed target image K is calculated from the source image _t′→t The calculation formula is as follows:

K _t′→t ＝K _t′ [proj(D _t ,T _t→t′ ,M)] (4)

in the formula, K _t′→t Calculating a reconstructed target image for the source image, K _t′ For the source image, proj () represents the scene depth map D _t Projected to a source image K _t′ M is the intrinsic internal reference matrix of the camera, D _t Scene depth map, T, being a target image _t→t′ A transformation matrix for the relative position and orientation between the target image and the source image]Representing a sampling operation.

Calculating a reconstructed target image K _t′→t With the actual target image K _t L1 distance between to obtain a reconstruction error L _p The calculation formula is as follows:

Minimizing reconstruction error L using gradient descent method _p The method comprises the steps of optimizing scene depth model parameters, performing data enhancement on a source image and a target image in an optimization process, and specifically, randomly selecting an area which is cut out from the center of the original image (the source image or the target image) and has the size of 1/2 or 1/4 of the original image as new training data. Inputting the N multiplied by M video frames obtained in the step 1 into the optimized model one by one to generate N multiplied by M frame scene depth maps which can effectively represent the information of moving objects and backgrounds in the monitored scene.

And 6, repeating the operations from the step 2 to the step 4 on the scene depth map extracted in the step 5 to obtain a video-level scene depth feature vector.

And 7, fusing the video level RGB feature vector extracted in the step 4 and the video level scene depth feature vector extracted in the step 6 to obtain a final video level feature vector.

The RGB feature and scene depth feature fusion mode adopts a self-adaptive fusion mode, the specific process is to standardize RGB input feature vectors to enable the mean value to be 0 and the standard deviation to be 1, then the mean value and the standard deviation are changed by using scene depth information, and the calculation mode is as follows:

in the formula, f (T) _rgb ,T _d ) For the final video level feature vector, T _rgb 、T _d Respectively representing RGB eigenvector and scene depth eigenvector, parameter fc _s And fc _b Learning from a full-link network, mu (T) _rgb )、σ(T _rgb ) Respectively representing the mean and standard deviation of the RGB feature vectors in L dimensions.

Firstly, training a small sample classifier, namely a prototype network, using a monitoring video abnormal behavior data set as training data, dividing the training data into a support set and a prediction set, randomly extracting N classes from the data set, and inputting K samples (N multiplied by K data in total) of each class as the support set; and extracting a batch of samples from the residual data in the N classes to be used as a prediction set of the model. When the unknown abnormal behaviors are predicted, the form of input data is the same as that during training, and the model learns to judge the labels of the prediction set samples through the support set. The characteristics of the support set and prediction set samples are firstly normalized by L2, and the L2 normalization formula of a vector X is as follows:

and then, a trainable coding network is passed, the coding network consists of two full-link layers and a middle activation function Relu, the input vector dimension of the first full-link layer is 4096, the output vector dimension is 4096, the input vector dimension of the second full-link layer is 4096, and the output vector dimension is 1024. The coding network plays a role in reducing the dimension of the feature vector, the characterization capability of the feature vector is further improved, the trainable parameters in the coding network can be increased by using the two full connection layers, and the nonlinear expression capability can be enhanced by an activation function between the two full connection layers. Then calculating the mean value of each class of K samples in the support set, and taking the mean value as the prototype B of each class of samples _i Calculating the prediction set sample A and each prototype B _i The cosine similarity of (a) is given by the formula:

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A small sample abnormal behavior identification method based on key areas and scene depths is characterized by comprising the following steps:

step 3, performing weighted offset-based key region selection on the two-dimensional global feature map extracted in the step 2 to obtain a key region containing abnormal behavior bodies in the video frame, and extracting a one-dimensional local feature vector of the key region by using a local feature extraction network;

connecting the global feature vector and the local feature vector end to end, wherein the dimension of the video-level RGB feature vector after fusion is the sum of the dimension of the global feature vector and the dimension of the local feature vector;

2. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in the step 1, firstly, a video is extracted into continuous video frames through ffmpeg software, the number of the video frames is counted through an os library, the video frames are equally divided into N parts, and M frames are randomly extracted from each part; the nxm frames are then further processed through the PIL library: if the width or height of the video frame is less than a, the size of the shorter side is adjusted to a, and the size of the longer side is adjusted to b; when the video frame is used for training, random position cutting is carried out on the video frame, the cutting size is a multiplied by a, and random vertical turning is carried out with the probability of 50%; when abnormal behaviors in a video frame are predicted, only the center position of the video frame is cut, and the cutting size is a multiplied by a; and finally, vectorizing the N multiplied by M frames and then normalizing the N multiplied by M frames to be used as a representative of the video.

3. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: and 2, inputting the NxM video frame data obtained in the step 1 into a Resnet-50 network for feature extraction, obtaining a global feature map from the output of the last convolutional layer of the Resnet-50 network, obtaining a global feature vector from the input of an average pooling layer, and loading parameters obtained by pre-training on a Kinetics video data set by the Resnet-50 network.

4. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 3, characterized in that: step 3.1 comprises the following steps:

step 3.1.3, extracting the motion characteristics of the normalized spatio-temporal characteristic diagram obtained in the step 3.1.1;

firstly, time dispersion is carried out, namely feature maps represented by continuous video frames are separated to obtain feature maps represented by each frame, then the feature maps represented by the frames are respectively input into a two-dimensional convolution to extract spatial features, and the difference of two-dimensional convolution output of each frame and an adjacent next frame is solved, namely:

X _out ＝K*X _t+1 -X _t (1)

5. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: step 3.2 is to uniformly take L points from the original image

And are provided with

Step 3.2.4, corresponding element u on the characteristic diagram _i As weights for offset vectors, displacement vectors

Summing after weighting to obtain sum vector

Sum vector

The pointed point is the center point of the key area.

6. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: step 3.3 first pass the center point of the key region

Translating to obtain other points on the key area

I.e.:

in the formula (I), the compound is shown in the specification,

then, each point on the key area is obtained by a bilinear interpolation method

Is formed by a plurality of pixelsValue of

The concrete formula is as follows:

in the formula (I), the compound is shown in the specification,

is composed of

(ii) coordinates of the four neighborhood points of (m) _ij ) ₀₀ ,(m _ij ) ₀₁ ,(m _ij ) ₁₀ ，(m _ij ) ₁₁ The pixel values of the corresponding four neighborhood points.

7. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 3.4, the local feature extraction network uses a Resnet-50 model pre-trained on a Kinetics video data set, and changes an average pooling layer of the Resnet-50 with an original fixed step length of 7 into an adaptive average pooling layer so as to realize feature extraction of the local feature extraction network on a key area with a relatively small size.

8. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 5, video frames are input into a monocular scene depth estimation model of the self-supervision learning one by one, and the model carries out scene depth estimation by using the combination of a lightweight U2net model encoder and a decoder; combining a scene depth estimation model with a pose estimation model for self-supervised training using a lightweight U2net model encoder, the self-supervised training the model by computing errors by reconstructing images, the specific process comprising estimating a target image by using the pose estimation modelRelative pose transformation matrix T between image and source image _t→t′ Predicting a scene depth map D of the target image using the scene depth estimation model _t The encoder weight of the scene depth estimation model is shared with a light-weight U2net encoder in the attitude estimation model; assuming that camera parameters in each image are unchanged, a scene depth map D of the target image is obtained _t And relative pose transformation matrix T _t→t′ And a camera intrinsic internal reference matrix M, which is used for calculating and obtaining a reconstructed target image K from a source image _t′→t The calculation formula is as follows:

K _t′→t ＝K _t′ [proj(D _t ,T _t→t′ ,M)] (4)

in the formula, K _t′→t Calculating to obtain a reconstructed target image for the source image, K _t′ For a source image, proj () represents a scene depth map D _t Projected onto a source image K _t′ M is the intrinsic internal reference matrix of the camera, D _t Depth map of scene, T, being a target image _t→t′ Is a relative pose transformation matrix between the target image and the source image]Representing a sampling operation;

where pe () is the photometric reconstruction error, i.e., the L1 distance in pixel space;

minimizing reconstruction error L using gradient descent method _p Optimizing scene depth model parameters, performing data enhancement on a source image and a target image in the optimization process, specifically, randomly selecting an area which is cut out from the center of the original image to be 1/2 or 1/4 of the original image in size as new training data, inputting the N × M video frames obtained in the step 1 into the optimized model one by one, and generating the N × M frame scene depth map.

9. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 7, a self-adaptive fusion mode is adopted for the fusion mode of the RGB features and the scene depth features, the specific process is to standardize RGB input feature vectors to enable the mean value to be 0 and the standard deviation to be 1, then the mean value and the standard deviation are changed by using scene depth information, and the calculation mode is as follows:

in the formula, f (T) _rgb ,T _d ) For the final video level feature vector, T _rgb 、T _d Respectively representing RGB eigenvectors and scene depth eigenvectors, parameter fc _s And fc _b Learning from a full connection layer network, mu (T) _rgb )、σ(T _rgb ) Respectively representing the mean and standard deviation of the RGB feature vectors in L dimensions.

10. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 8, training a small sample classifier, namely a prototype network, using a monitoring video abnormal behavior data set as training data, dividing the training data into a support set and a prediction set, randomly extracting N classes in the data set, wherein each class comprises K samples, and N multiplied by K data in total are input as the support set; extracting a batch of samples from the residual data in the N classes to be used as a prediction set of the model; when unknown abnormal behaviors are predicted, the form of input data is the same as that during training, and the model learns to judge the label of a prediction set sample through a support set; the characteristics of the support set and prediction set samples are firstly normalized by L2, and the L2 normalization formula of a vector X is as follows:

then through a trainable coding network, codingThe code network consists of two full-connection layers and a middle activation function Relu, the input vector dimension of the first full-connection layer is 4096, the output vector dimension is 4096, the input vector dimension of the second full-connection layer is 4096, and the output vector dimension is 1024; then calculating the mean value of each class of K samples in the support set, and taking the mean value as the prototype B of each class of samples _i Calculating the prediction set sample A and each prototype B _i The cosine similarity of (a) is given by the formula: