CN115439926A - Small sample abnormal behavior identification method based on key region and scene depth - Google Patents

Small sample abnormal behavior identification method based on key region and scene depth Download PDF

Info

Publication number
CN115439926A
CN115439926A CN202210936032.7A CN202210936032A CN115439926A CN 115439926 A CN115439926 A CN 115439926A CN 202210936032 A CN202210936032 A CN 202210936032A CN 115439926 A CN115439926 A CN 115439926A
Authority
CN
China
Prior art keywords
video
scene depth
feature
abnormal behavior
key area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210936032.7A
Other languages
Chinese (zh)
Inventor
肖进胜
吴原顼
眭海刚
姚韵涛
王中元
王澍瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210936032.7A priority Critical patent/CN115439926A/en
Publication of CN115439926A publication Critical patent/CN115439926A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a small sample abnormal behavior identification method based on key areas and scene depths. The method comprises the steps of firstly carrying out random sparse sampling on a video, carrying out global feature extraction on a sampled video frame, then inputting a global feature map into a key region selection module based on weighted offset, obtaining a key region containing an abnormal behavior body, carrying out local feature extraction, then fusing the global feature and the local feature to obtain video-level RGB features, then extracting a corresponding scene depth map from the video frame, repeating the feature extraction step on the scene depth map to obtain video-level scene depth features, finally fusing the video-level RGB features and the scene depth features to obtain final video-level features, and inputting the video-level features into a small sample classifier to obtain an abnormal behavior identification result. The method and the device aim at identifying the abnormal behavior of the monitoring scene, improve the accuracy, the calculation efficiency and the robustness, and are suitable for the conditions of monitoring videos with multiple moving targets and complex backgrounds.

Description

Small sample abnormal behavior identification method based on key region and scene depth
Technical Field
The invention belongs to the technical field of video image processing, and particularly relates to a small sample abnormal behavior identification method based on key areas and scene depths.
Background
The abnormal behavior recognition realizes intelligent recognition of abnormal behaviors in the monitoring video by utilizing the technologies in the fields of deep learning and the like. In recent years, many social safety hazards which have great influence occur in public places such as railway stations, subway stations, campuses and the like, and how to effectively maintain public safety becomes a focus of social attention. The monitoring camera is used for identifying abnormal behaviors in time and giving early warning, and the method is an important means for maintaining public safety, however, the traditional method for identifying the abnormal behaviors by manpower is easy to cause false detection and missing detection due to fatigue caused by long-time work, and therefore, the computer is used for realizing automatic intelligent identification of the abnormal behaviors in the monitoring video.
The identification of abnormal behaviors has the major difficulties that the abnormal behaviors have small occurrence probability and small number of abnormal samples compared with normal behaviors. Aiming at the difficulty, a concept of small sample abnormal behavior recognition is introduced, namely, the abnormal behavior recognition model has the capacity of recognizing brand new abnormal behavior categories with few samples through small sample learning, and the problem of small number of abnormal samples is solved. Small sample learning is generally based on the principle of meta-learning, i.e. learning some commonality from other large amounts of data, for identifying new classes. The conventional small sample learning model is usually based on a measurement method, namely, the distance between samples is modeled, the distance between the samples of the same type is close, the distance between the samples of different types is increased, the similarity between the samples is judged according to the distance, and the type of an unknown sample is further judged. In addition, the identification of the abnormal behavior also has the difficulty that the abnormal behavior is complex, namely the expression forms of the abnormal behavior in different monitoring scenes and different behavior subjects are different.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a small sample abnormal behavior identification method based on a key area and scene depth, and aims to improve the accuracy, the calculation efficiency and the robustness of abnormal behavior identification aiming at a monitoring scene.
In order to achieve the purpose, the technical scheme provided by the invention is a small sample abnormal behavior identification method based on a key area and scene depth, which comprises the following steps:
step 1, randomly sampling sparsely a video, dividing the video into N sections according to the number of frames, randomly sampling M frames in each section, and taking the total N multiplied by M frames as a representative of the video;
step 2, performing feature extraction on the video frame generated in the step 1 by using a global feature extraction network to obtain a two-dimensional global feature map and a one-dimensional global feature vector;
step 3, performing weighted offset-based key region selection on the two-dimensional global feature map extracted in the step 2 to obtain a key region containing abnormal behavior bodies in a video frame, and extracting one-dimensional local feature vectors of the key region by using a local feature extraction network;
step 3.1, performing space-time feature extraction and motion feature extraction on the two-dimensional global feature map extracted in the step 2 to generate a feature map with space-time information and object motion information;
3.2, selecting the central point of the key area based on the weighted deviation;
3.3, obtaining pixel values of other points in the key area by utilizing bilinear interpolation;
step 3.4, inputting the key area into a local feature extraction network to obtain the local feature of the key area;
step 4, fusing the global feature vector extracted in the step 2 and the local feature vector extracted in the step 3 to obtain a video-level RGB feature vector;
step 5, processing the NxM frames generated in the step 1 by using a monocular scene depth estimation model to obtain a corresponding NxM frame scene depth map;
step 6, repeating the operations from the step 2 to the step 4 on the scene depth map extracted in the step 5 to obtain a video-level scene depth feature vector;
step 7, fusing the video level RGB feature vector extracted in the step 4 and the video level scene depth feature vector extracted in the step 6 to obtain a final video level feature vector;
and 8, inputting the video-level feature vector obtained in the step 7 into a small sample classifier to obtain a final abnormal behavior identification result.
In step 1, firstly, extracting a video into continuous video frames through ffmpeg software, counting the number of the video frames through an os library, equally dividing the video frames into N parts, and randomly extracting M frames from each part; the nxm frames are then further processed through the PIL library: if the width or height of the video frame is less than a, adjusting the size of the shorter side to be a and the size of the longer side to be b; when the video frame is used for training, random position cutting is carried out on the video frame, the cutting size is a multiplied by a, and random vertical turning is carried out with the probability of 50%; when abnormal behaviors in a video frame are predicted, only the center position of the video frame is cut, and the cutting size is a multiplied by a; and finally, carrying out vectorization on the NxM frames and then carrying out normalization to obtain a representative video.
And in the step 2, the NxM video frame data obtained in the step 1 is input into a Resnet-50 network for feature extraction, a global feature map is obtained from the output of the last convolutional layer of the Resnet-50 network, a global feature vector is obtained from the input of an average pooling layer, and the Resnet-50 loads parameters obtained from the kinetic video data set through pre-training.
Moreover, said step 3.1 comprises the following steps:
step 3.1.1, channel mean normalization is carried out on the input video frame feature map, namely, the input multiple channels are averaged to obtain single-channel output;
step 3.1.2, performing space-time feature extraction on the normalized feature map obtained in the step 3.1.1;
firstly, data reconstruction is carried out, the time dimension and the channel dimension of a video frame feature graph are exchanged through the data reconstruction, then three-dimensional convolution is input, the three-dimensional convolution network can extract the spatio-temporal information of a video frame, then the data reconstruction is carried out again to recover the dimension, namely, the time dimension and the channel dimension are exchanged again, and finally the normalization is carried out through a Sigmoid function to obtain a spatio-temporal feature graph;
step 3.1.3, extracting the motion characteristics of the normalized space-time characteristic diagram obtained in the step 3.1.1;
firstly, time dispersion is carried out, namely, feature maps represented by continuous video frames are separated to obtain feature maps respectively represented by each frame, then, the feature maps respectively represented by each frame are respectively input into two-dimensional convolution to extract spatial features, and the difference of two-dimensional convolution output of each frame and the adjacent next frame is solved, namely:
X out =K*X t+1 -X t (1)
where K represents a parameter learned by training of the two-dimensional convolution, X t+1 、X t Respectively representing the characteristic graphs of the t +1 th frame and the t th frame;
finally, connecting all the obtained difference values and normalizing the difference values through a Sigmoid function to obtain a motion characteristic graph;
step 3.1.4, adding the feature maps output in the step 3.1.2 and the step 3.1.3 and the feature map generated in the step 3.1.1 by using a residual error structure to obtain a feature map with space-time information and motion information;
and 3.1.5, performing two-dimensional softmax operation on the feature map obtained in the step 3.1.4, namely inputting elements of each row and each column on the two-dimensional feature map into a softmax function, so that all the elements on the two-dimensional feature map are added to be 1.
In step 3.2, L points are uniformly selected from the original image
Figure BDA0003783402160000031
Pointing each point a with the center point of the original drawing i Obtaining an offset vector
Figure BDA0003783402160000032
Element u on the feature map extracted in step 3.1 i As weights, all the offset vectors are weighted and summed to obtain a sum vector
Figure BDA0003783402160000033
And are provided with
Figure BDA0003783402160000034
The pointed point is the central point of the key area, and the method comprises the following steps:
step 3.2.1, selecting a central point of the key area in a square area with the side length being the difference between the side length of the original image and the side length of the key area and being positioned at the center of the original image;
step 3.2.2, uniformly taking L points from the boundary of the square area selected in the step 3.2.1
Figure BDA0003783402160000035
The number L of the points is the same as the number of elements on the feature map with the spatio-temporal information and the motion information extracted in the step 3.1, and the points respectively correspond to the elements on the feature map from left to right and from top to bottom in the space;
step 3.2.3, the central point of the original picture is taken to point to each point a i Is an offset vector
Figure BDA0003783402160000036
Step 32.4, corresponding element u on the characteristic diagram i As weights for offset vectors, displacement vectors
Figure BDA0003783402160000037
Summing after weighting to obtain sum vector
Figure BDA0003783402160000038
Sum vector
Figure BDA0003783402160000039
The pointed point is the center point of the key area.
Furthermore, the step 3.3 firstly passes through the center point of the key area
Figure BDA00037834021600000310
Translating to obtain other points on the key area
Figure BDA00037834021600000311
I.e.:
Figure BDA0003783402160000041
in the formula (I), the compound is shown in the specification,
Figure BDA0003783402160000042
coordinates representing offset vectors of other points on the key area relative to the central point;
then, each point on the key area is obtained by a bilinear interpolation method
Figure BDA0003783402160000043
Pixel value of
Figure BDA0003783402160000044
The concrete formula is as follows:
Figure BDA0003783402160000045
in the formula (I), the compound is shown in the specification,
Figure BDA0003783402160000046
is composed of
Figure BDA0003783402160000047
(m) coordinates of four neighborhood points of (c), (m) ij ) 00 ,(m ij ) 01 ,(m ij ) 10 ,(m ij ) 11 The pixel values of the corresponding four neighborhood points.
In step 3.4, the local feature extraction network uses a Resnet-50 model pre-trained on a Kinetics video data set, and changes the original mean pooling layer with the fixed step length of 7 of the Resnet-50 into an adaptive mean pooling layer, so as to realize feature extraction of a key area with a relatively small size by the local feature extraction network.
And in the step 4, the global feature vector and the local feature vector are connected end to end, and the dimension of the fused video-level RGB feature vector is the sum of the dimension of the global feature vector and the dimension of the local feature vector.
In step 5, video frames are input into a monocular scene depth estimation model for the self-supervised learning one by one, and the model uses a combination of a lightweight U2net model encoder and a decoder for scene depth estimation; combining a scene depth estimation model with a pose estimation model for self-supervision training, wherein the pose estimation model uses a lightweight U2net model encoder, the self-supervision training calculates errors by reconstructing images and trains the models, and the specific process comprises estimating a relative pose transformation matrix T between a target image and a source image by using the pose estimation model t→t′ Predicting a scene depth map D of the target image using the scene depth estimation model t The encoder weight of the scene depth estimation model is shared with a light-weight U2net encoder in the attitude estimation model; assuming camera parameters in each image are unchanged, passing through a target image scene depth map D t And relative pose transformation matrix T t→t′ And a camera intrinsic internal reference matrix M for calculating a reconstructed target image K from the source image t′→t The calculation formula is as follows:
K t′→t =K t′ [proj(D t ,T t→t′ ,M)] (4)
in the formula, K t′→t Calculating to obtain a reconstructed target image for the source image, K t′ For a source image, proj () represents a scene depth map D t Projected to a source image K t′ M is the intrinsic internal reference matrix of the camera, D t Scene depth map, T, being a target image t→t′ Is a relative pose transformation matrix between the target image and the source image]Representing a sampling operation.
Calculating a reconstructed target image K t′→t With the actual target image K t L1 distance therebetween to obtain a reconstruction error L p The calculation formula is as follows:
Figure BDA0003783402160000048
where pe () is the photometric reconstruction error, i.e., the L1 distance in pixel space.
Minimizing reconstruction error L using gradient descent method p Optimizing scene depth model parameters, performing data enhancement on a source image and a target image in the optimization process, specifically randomly selecting an area which is cut out from the center of the original image to be 1/2 or 1/4 of the size of the original image as new training data, inputting the NxM video frames obtained in the step 1 into the optimized model one by one, and generating the NxM frame scene depth map.
Moreover, the fusion mode of the RGB features and the scene depth features in step 7 is a self-adaptive fusion mode, and the specific process is to normalize the RGB input feature vectors to make the mean value of the RGB input feature vectors 0 and the standard deviation of the RGB input feature vectors 1, and then change the mean value and the standard deviation by using the scene depth information, and the calculation mode is as follows:
Figure BDA0003783402160000051
wherein f (T) rgb ,T d ) For the final video level feature vector, T rgb 、T d Respectively representing RGB eigenvector and scene depth eigenvector, parameter fc s And fc b Learning from a full connection layer network, mu (T) rgb )、σ(T rgb ) Respectively representing the mean and standard deviation of the RGB feature vectors in L dimensions.
In step 8, a small sample classifier, i.e., a prototype network, is trained, a monitoring video abnormal behavior data set is used as training data, the training data is divided into a support set and a prediction set, N classes are randomly extracted from the data set, each class includes K samples, and N × K data are input as the support set; extracting a batch of samples from the residual data in the N classes to be used as a prediction set of the model; when the unknown abnormal behavior is predicted, the form of input data is the same as that during training, and the model learns to judge the label of a prediction set sample through a support set; the characteristics of the support set and prediction set samples are firstly normalized by L2, and the L2 normalization formula of a vector X is as follows:
Figure BDA0003783402160000052
secondly, a trainable coding network is used, the coding network consists of two full-connection layers and a middle activation function Relu, the input vector dimension of the first full-connection layer is 4096, the output vector dimension is 4096, the input vector dimension of the second full-connection layer is 4096, and the output vector dimension is 1024; then, calculating the mean value of each class of K samples in the support set, and taking the mean value as the prototype B of each class of samples i Calculating the prediction set sample A and each prototype B i The cosine similarity of (a) is given by the formula:
Figure BDA0003783402160000053
and obtaining the normalized probability of all cosine similarity through a softmax function, judging the type of the abnormal behavior according to the probability, and taking the abnormal behavior with the maximum corresponding probability as a final recognition result.
Compared with the prior art, the invention has the following advantages:
the global characteristics of the abnormal behaviors and the local characteristics of key areas containing the main bodies of the abnormal behaviors are fused, and the RGB characteristics and the scene depth characteristics containing the moving objects and the background information are fused, so that the accuracy and the calculation efficiency of the abnormal behavior identification are improved, and the robustness of the monitoring video with multiple moving objects and complex backgrounds is realized.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a structural diagram of a spatiotemporal feature extraction module and a motion feature extraction module according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a weighted-offset-based critical area selection process according to an embodiment of the present invention.
Fig. 4 is a flowchart of scene depth map extraction according to an embodiment of the present invention.
FIG. 5 is a flow chart of a small sample classifier according to an embodiment of the present invention.
Detailed Description
The invention provides a small sample abnormal behavior identification method based on a key area and scene depth, and the technical scheme of the invention is further explained by combining drawings and embodiments.
As shown in fig. 1, the process of the embodiment of the present invention includes the following steps:
step 1, performing random sparse sampling on a video, dividing the video into N sections according to the number of frames, randomly sampling M frames in each section, and taking the total N multiplied by M frames as a representative of the video.
Extracting a video into continuous video frames through ffmpeg software, counting the number of the video frames through an os library, equally dividing the video frames into N parts (taking 4 as N in the embodiment), and randomly extracting M frames from each part (taking 2 as M in the embodiment). The nxm frames are further processed through the PIL library: if the video frame has one side (width or height) less than 224, the shorter side is resized to 224 and the longer side to 256. When the video frame is used for training, randomly cutting the video frame to 224 multiplied by 224, and randomly vertically turning the video frame with 50% probability; when abnormal behaviors in the video frame are predicted, only the center position of the video frame is cut, and the cutting size is 224 multiplied by 224. And carrying out vectorization on the N multiplied by M frames and then carrying out normalization to obtain a representative video.
And 2, performing feature extraction on the video frame generated in the step 1 by using a global feature extraction network to obtain a two-dimensional global feature map and a one-dimensional global feature vector.
Inputting the NxM video frame data obtained in the step 1 or the NxM scene depth data obtained in the step 5 into a Resnet-50 network for feature extraction, obtaining a global feature map from the output of the last convolutional layer of the Resnet-50 network, and obtaining a global feature vector from the input of an average pooling layer. Resnet-50 loads parameters pre-trained from the Kinetics video data set.
And 3, performing weighted offset-based key region selection on the two-dimensional global feature map extracted in the step 2 to obtain a key region containing abnormal behavior bodies in the video frame, and extracting a one-dimensional local feature vector of the key region by using a local feature extraction network.
And 3.1, performing space-time feature extraction and motion feature extraction on the two-dimensional global feature map extracted in the step 2 to generate a feature map with space-time information and object motion information.
And 3.1.1, performing channel mean normalization on the input video frame characteristic diagram, namely averaging a plurality of input channels to obtain single-channel output.
And 3.1.2, performing space-time feature extraction on the normalized feature map obtained in the step 3.1.1.
Firstly, data reconstruction is carried out, and dimension conversion is realized through the data reconstruction. The input form of the feature map of the continuous video frame is NXTXCXHXW, wherein N is the number of samples in a batch, T is the time dimension of the video frame, C represents the channel dimension, H is the video frame height, and W is the video frame width. The video frame feature map time dimension and channel dimension are interchanged so that the convolution can then handle both the time dimension and the two spatial dimensions. And then inputting three-dimensional convolution, wherein the three-dimensional convolution network can extract the spatio-temporal information of the video frame, then reconstructing data again to recover the dimensionality, namely, interchanging the time dimensionality and the channel dimensionality again, and finally normalizing through a Sigmoid function to obtain a spatio-temporal feature map.
And 3.1.3, extracting the motion characteristics of the normalized space-time characteristic diagram obtained in the step 3.1.1.
Firstly, time dispersion is carried out, namely, feature maps represented by continuous video frames are separated to obtain feature maps respectively represented by each frame, then, the feature maps respectively represented by each frame are respectively input into two-dimensional convolution to extract spatial features, and the difference of two-dimensional convolution output of each frame and the adjacent next frame is solved, namely:
X out =K*X t+1 -X t (1)
where K represents a parameter learned by training of the two-dimensional convolution, X t+1 、X t Respectively representing the characteristic graphs input by the t +1 th frame and the t th frame;
and finally, connecting all the obtained difference values and carrying out normalization through a Sigmoid function to obtain a motion characteristic diagram.
And 3.1.4, adding the characteristic diagrams output in the steps 3.1.2 and 3.1.3 and the characteristic diagram generated in the step 3.1.1 by using a residual error structure to obtain a characteristic diagram with space-time information and motion information.
And 3.1.5, performing two-dimensional softmax operation on the feature map obtained in the step 3.1.4, namely inputting elements of each row and each column on the two-dimensional feature map into a softmax function, so that all the elements on the two-dimensional feature map are added to be 1.
And 3.2, selecting the central point of the key area based on the weighted deviation.
Uniformly taking L points from the original image (the video frame obtained in step 1 or the scene depth map obtained in step 5)
Figure BDA0003783402160000071
Pointing each point a with the center point of the original drawing i Obtaining an offset vector
Figure BDA0003783402160000072
Element u on the feature map extracted in step 3.1 i As weights, for all offsetsThe quantities are weighted and summed to obtain a sum vector
Figure BDA0003783402160000073
Figure BDA0003783402160000074
The pointed point is the center point of the key area.
And 3.2.1, as shown in fig. 3 (a), considering that the intercepted key area has a certain size, and the central point of the key area can only be selected in a certain range in the original image, namely, the side length is the difference between the side length of the original image and the side length of the key area and is positioned in the square area at the center of the original image.
Step 3.2.2, as shown in FIG. 3 (b), take L points uniformly from the boundary for the square area selected in step 3.2.1
Figure BDA0003783402160000081
The number L of points is the same as the number of elements on the feature map with spatio-temporal information and motion information extracted in step 3.1, and each point corresponds to each element on the feature map respectively according to the sequence from left to right and from top to bottom in space.
Step 3.2.3, as shown in FIG. 3 (c), the central point of the original image is taken to point to each point a i Is an offset vector
Figure BDA0003783402160000082
Step 3.2.4, as shown in FIG. 3 (d), with corresponding element u on the feature map i As weights for offset vectors, displacement vectors
Figure BDA0003783402160000083
Weighted and summed to obtain the final
Figure BDA0003783402160000084
Sum vector
Figure BDA0003783402160000085
The pointed point is the center point of the key area.
And 3.3, obtaining pixel values of other points in the key area by utilizing bilinear interpolation.
Since the size of the key area is fixed, the center point of the key area is determined
Figure BDA0003783402160000086
Then, the whole key area can be determined, and the center point of the key area is determined
Figure BDA0003783402160000087
Translating to obtain other points on the key area
Figure BDA0003783402160000088
I.e.:
Figure BDA0003783402160000089
in the formula (I), the compound is shown in the specification,
Figure BDA00037834021600000810
coordinates representing the offset vectors of other points on the critical area with respect to the center point.
Since the coordinates of the center point of the key area obtained by the weighted deviation in the step 3.2 have decimal numbers, the coordinates of the points on the key area are not integers, and no corresponding pixel value corresponds to the original image, a bilinear interpolation method is adopted, namely, each point on the key area is corresponding to the pixel value
Figure BDA00037834021600000811
Pixel value of
Figure BDA00037834021600000812
All obtained from the nearest four points on the original image adjacent to the points, and the specific formula is as follows:
Figure BDA00037834021600000813
in the formula (I), the compound is shown in the specification,
Figure BDA00037834021600000814
is composed of
Figure BDA00037834021600000815
(ii) coordinates of the four neighborhood points of (m) ij ) 00 ,(m ij ) 01 ,(m ij ) 10 ,(m ij ) 11 The pixel values of the corresponding four-neighborhood points. Because the coordinates of all points on the area are continuous rather than discrete, the above mode can find accurate key areas for abnormal behavior subjects at different positions.
And 3.4, inputting the key area into a local feature extraction network to obtain the local feature of the key area.
The local feature extraction network also uses a Resnet-50 model pre-trained on a Kinetics video data set, and is different from the global feature extraction network in that a mean pooling layer of the Resnet-50 with the original fixed step length of 7 is changed into an adaptive mean pooling layer so as to realize feature extraction of a key area with a relatively small size by the local feature extraction network.
And 4, fusing the global feature vector extracted in the step 2 and the local feature vector extracted in the step 3 to obtain a video-level RGB feature vector.
The local feature fusion and the global feature fusion use a connection mode, specifically, a global feature vector and a local feature vector are connected end to end, and the fused video level RGB feature vector dimension is the sum of the global feature vector dimension and the local feature vector dimension.
And 5, processing the NxM frames generated in the step 1 by using the monocular scene depth estimation model to obtain a corresponding NxM frame scene depth map.
Video frames are input into a monocular scene depth estimation model for self-supervised learning one by one, and the model carries out scene depth estimation by using a combination of a lightweight U2net model encoder and a decoder. The scene depth estimation model and another model combination which uses a light-weight U2net encoder for attitude estimation are subjected to self-supervision training, data enhancement is carried out on an input source image and an input target image during training, the extracted scene depth image is robust to noise, and meanwhile scene and target information can be effectively represented.
The self-supervision training is used for training a model by calculating errors through image reconstruction, and the specific process comprises the step of estimating a relative pose transformation matrix T between a target image and a source image through the use of a posture estimation model t→t′ The pose estimation model uses a lightweight U2net model encoder. Compared with a general U2net model, the number of input channels, the number of intermediate channels and the number of output channels in each residual U structure block in the light-weight U2net model are fewer, so that the space occupied by the model is smaller, and the calculation speed is higher. Then, a scene depth map D of the target image is predicted by using the scene depth estimation model t The scene depth estimation model uses a lightweight U2net encoder and decoder combination, where the encoder weights are shared with the lightweight U2net encoder in the pose estimation model. Depth map D of scene through target image t And relative pose transformation matrix T t→t′ And a camera intrinsic internal reference matrix M (assuming that camera internal reference in each image is constant), and a reconstructed target image K is calculated from the source image t′→t The calculation formula is as follows:
K t′→t =K t′ [proj(D t ,T t→t′ ,M)] (4)
in the formula, K t′→t Calculating a reconstructed target image for the source image, K t′ For the source image, proj () represents the scene depth map D t Projected to a source image K t′ M is the intrinsic internal reference matrix of the camera, D t Scene depth map, T, being a target image t→t′ A transformation matrix for the relative position and orientation between the target image and the source image]Representing a sampling operation.
Calculating a reconstructed target image K t′→t With the actual target image K t L1 distance between to obtain a reconstruction error L p The calculation formula is as follows:
Figure BDA0003783402160000091
where pe () is the photometric reconstruction error, i.e., the L1 distance in pixel space.
Minimizing reconstruction error L using gradient descent method p The method comprises the steps of optimizing scene depth model parameters, performing data enhancement on a source image and a target image in an optimization process, and specifically, randomly selecting an area which is cut out from the center of the original image (the source image or the target image) and has the size of 1/2 or 1/4 of the original image as new training data. Inputting the N multiplied by M video frames obtained in the step 1 into the optimized model one by one to generate N multiplied by M frame scene depth maps which can effectively represent the information of moving objects and backgrounds in the monitored scene.
And 6, repeating the operations from the step 2 to the step 4 on the scene depth map extracted in the step 5 to obtain a video-level scene depth feature vector.
And 7, fusing the video level RGB feature vector extracted in the step 4 and the video level scene depth feature vector extracted in the step 6 to obtain a final video level feature vector.
The RGB feature and scene depth feature fusion mode adopts a self-adaptive fusion mode, the specific process is to standardize RGB input feature vectors to enable the mean value to be 0 and the standard deviation to be 1, then the mean value and the standard deviation are changed by using scene depth information, and the calculation mode is as follows:
Figure BDA0003783402160000101
in the formula, f (T) rgb ,T d ) For the final video level feature vector, T rgb 、T d Respectively representing RGB eigenvector and scene depth eigenvector, parameter fc s And fc b Learning from a full-link network, mu (T) rgb )、σ(T rgb ) Respectively representing the mean and standard deviation of the RGB feature vectors in L dimensions.
And 8, inputting the video-level feature vector obtained in the step 7 into a small sample classifier to obtain a final abnormal behavior identification result.
Firstly, training a small sample classifier, namely a prototype network, using a monitoring video abnormal behavior data set as training data, dividing the training data into a support set and a prediction set, randomly extracting N classes from the data set, and inputting K samples (N multiplied by K data in total) of each class as the support set; and extracting a batch of samples from the residual data in the N classes to be used as a prediction set of the model. When the unknown abnormal behaviors are predicted, the form of input data is the same as that during training, and the model learns to judge the labels of the prediction set samples through the support set. The characteristics of the support set and prediction set samples are firstly normalized by L2, and the L2 normalization formula of a vector X is as follows:
Figure BDA0003783402160000102
and then, a trainable coding network is passed, the coding network consists of two full-link layers and a middle activation function Relu, the input vector dimension of the first full-link layer is 4096, the output vector dimension is 4096, the input vector dimension of the second full-link layer is 4096, and the output vector dimension is 1024. The coding network plays a role in reducing the dimension of the feature vector, the characterization capability of the feature vector is further improved, the trainable parameters in the coding network can be increased by using the two full connection layers, and the nonlinear expression capability can be enhanced by an activation function between the two full connection layers. Then calculating the mean value of each class of K samples in the support set, and taking the mean value as the prototype B of each class of samples i Calculating the prediction set sample A and each prototype B i The cosine similarity of (a) is given by the formula:
Figure BDA0003783402160000111
and obtaining the normalized probability of all cosine similarity through a softmax function, judging the type of the abnormal behavior according to the probability, and taking the abnormal behavior with the maximum corresponding probability as a final recognition result.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (10)

1. A small sample abnormal behavior identification method based on key areas and scene depths is characterized by comprising the following steps:
step 1, randomly sampling sparsely a video, dividing the video into N sections according to the number of frames, randomly sampling M frames in each section, and taking the total N multiplied by M frames as a representative of the video;
step 2, performing feature extraction on the video frame generated in the step 1 by using a global feature extraction network to obtain a two-dimensional global feature map and a one-dimensional global feature vector;
step 3, performing weighted offset-based key region selection on the two-dimensional global feature map extracted in the step 2 to obtain a key region containing abnormal behavior bodies in the video frame, and extracting a one-dimensional local feature vector of the key region by using a local feature extraction network;
step 3.1, performing space-time feature extraction and motion feature extraction on the two-dimensional global feature map extracted in the step 2 to generate a feature map with space-time information and object motion information;
3.2, selecting the central point of the key area based on the weighted deviation;
3.3, obtaining pixel values of other points in the key area by utilizing bilinear interpolation;
step 3.4, inputting the key area into a local feature extraction network to obtain the local feature of the key area;
step 4, fusing the global feature vector extracted in the step 2 and the local feature vector extracted in the step 3 to obtain a video-level RGB feature vector;
connecting the global feature vector and the local feature vector end to end, wherein the dimension of the video-level RGB feature vector after fusion is the sum of the dimension of the global feature vector and the dimension of the local feature vector;
step 5, processing the NxM frames generated in the step 1 by using a monocular scene depth estimation model to obtain a corresponding NxM frame scene depth map;
step 6, repeating the operations from the step 2 to the step 4 on the scene depth map extracted in the step 5 to obtain a video-level scene depth feature vector;
step 7, fusing the video level RGB feature vector extracted in the step 4 and the video level scene depth feature vector extracted in the step 6 to obtain a final video level feature vector;
and 8, inputting the video-level feature vector obtained in the step 7 into a small sample classifier to obtain a final abnormal behavior identification result.
2. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in the step 1, firstly, a video is extracted into continuous video frames through ffmpeg software, the number of the video frames is counted through an os library, the video frames are equally divided into N parts, and M frames are randomly extracted from each part; the nxm frames are then further processed through the PIL library: if the width or height of the video frame is less than a, the size of the shorter side is adjusted to a, and the size of the longer side is adjusted to b; when the video frame is used for training, random position cutting is carried out on the video frame, the cutting size is a multiplied by a, and random vertical turning is carried out with the probability of 50%; when abnormal behaviors in a video frame are predicted, only the center position of the video frame is cut, and the cutting size is a multiplied by a; and finally, vectorizing the N multiplied by M frames and then normalizing the N multiplied by M frames to be used as a representative of the video.
3. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: and 2, inputting the NxM video frame data obtained in the step 1 into a Resnet-50 network for feature extraction, obtaining a global feature map from the output of the last convolutional layer of the Resnet-50 network, obtaining a global feature vector from the input of an average pooling layer, and loading parameters obtained by pre-training on a Kinetics video data set by the Resnet-50 network.
4. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 3, characterized in that: step 3.1 comprises the following steps:
step 3.1.1, channel mean normalization is carried out on the input video frame feature map, namely, the input multiple channels are averaged to obtain single-channel output;
step 3.1.2, performing space-time feature extraction on the normalized feature map obtained in the step 3.1.1;
firstly, data reconstruction is carried out, the time dimension and the channel dimension of a video frame feature graph are exchanged through the data reconstruction, then three-dimensional convolution is input, the three-dimensional convolution network can extract the spatio-temporal information of a video frame, then the data reconstruction is carried out again to recover the dimension, namely, the time dimension and the channel dimension are exchanged again, and finally the normalization is carried out through a Sigmoid function to obtain a spatio-temporal feature graph;
step 3.1.3, extracting the motion characteristics of the normalized spatio-temporal characteristic diagram obtained in the step 3.1.1;
firstly, time dispersion is carried out, namely feature maps represented by continuous video frames are separated to obtain feature maps represented by each frame, then the feature maps represented by the frames are respectively input into a two-dimensional convolution to extract spatial features, and the difference of two-dimensional convolution output of each frame and an adjacent next frame is solved, namely:
X out =K*X t+1 -X t (1)
where K represents a parameter learned by training of the two-dimensional convolution, X t+1 、X t Respectively representing the characteristic graphs input by the t +1 th frame and the t th frame;
finally, connecting all the obtained difference values and normalizing the difference values through a Sigmoid function to obtain a motion characteristic graph;
step 3.1.4, adding the feature maps output in the step 3.1.2 and the step 3.1.3 and the feature map generated in the step 3.1.1 by using a residual error structure to obtain a feature map with space-time information and motion information;
and 3.1.5, performing two-dimensional softmax operation on the feature map obtained in the step 3.1.4, namely inputting elements of each row and each column on the two-dimensional feature map into a softmax function, so that all the elements on the two-dimensional feature map are added to be 1.
5. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: step 3.2 is to uniformly take L points from the original image
Figure FDA0003783402150000031
Pointing each point a with the center point of the original drawing i Obtaining an offset vector
Figure FDA0003783402150000032
Element u on the feature map extracted in step 3.1 i As weights, all the offset vectors are weighted and summed to obtain a sum vector
Figure FDA0003783402150000033
And are provided with
Figure FDA0003783402150000034
The pointed point is the central point of the key area, and the method comprises the following steps:
step 3.2.1, selecting a central point of the key area in a square area with the side length being the difference between the side length of the original image and the side length of the key area and being positioned at the center of the original image;
step 3.2.2, uniformly taking L points from the boundary of the square area selected in the step 3.2.1
Figure FDA0003783402150000035
The number L of the points is the same as the number of elements on the feature map with the spatio-temporal information and the motion information extracted in the step 3.1, and the points respectively correspond to the elements on the feature map from left to right and from top to bottom in the space;
step 3.2.3, the central point of the original picture is taken to point to each point a i Is an offset vector
Figure FDA0003783402150000036
Step 3.2.4, corresponding element u on the characteristic diagram i As weights for offset vectors, displacement vectors
Figure FDA0003783402150000037
Summing after weighting to obtain sum vector
Figure FDA0003783402150000038
Sum vector
Figure FDA0003783402150000039
The pointed point is the center point of the key area.
6. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: step 3.3 first pass the center point of the key region
Figure FDA00037834021500000310
Translating to obtain other points on the key area
Figure FDA00037834021500000311
I.e.:
Figure FDA00037834021500000312
in the formula (I), the compound is shown in the specification,
Figure FDA00037834021500000313
coordinates representing offset vectors of other points on the key area relative to the central point;
then, each point on the key area is obtained by a bilinear interpolation method
Figure FDA00037834021500000314
Is formed by a plurality of pixelsValue of
Figure FDA00037834021500000315
The concrete formula is as follows:
Figure FDA00037834021500000316
in the formula (I), the compound is shown in the specification,
Figure FDA00037834021500000317
is composed of
Figure FDA00037834021500000318
(ii) coordinates of the four neighborhood points of (m) ij ) 00 ,(m ij ) 01 ,(m ij ) 10 ,(m ij ) 11 The pixel values of the corresponding four neighborhood points.
7. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 3.4, the local feature extraction network uses a Resnet-50 model pre-trained on a Kinetics video data set, and changes an average pooling layer of the Resnet-50 with an original fixed step length of 7 into an adaptive average pooling layer so as to realize feature extraction of the local feature extraction network on a key area with a relatively small size.
8. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 5, video frames are input into a monocular scene depth estimation model of the self-supervision learning one by one, and the model carries out scene depth estimation by using the combination of a lightweight U2net model encoder and a decoder; combining a scene depth estimation model with a pose estimation model for self-supervised training using a lightweight U2net model encoder, the self-supervised training the model by computing errors by reconstructing images, the specific process comprising estimating a target image by using the pose estimation modelRelative pose transformation matrix T between image and source image t→t′ Predicting a scene depth map D of the target image using the scene depth estimation model t The encoder weight of the scene depth estimation model is shared with a light-weight U2net encoder in the attitude estimation model; assuming that camera parameters in each image are unchanged, a scene depth map D of the target image is obtained t And relative pose transformation matrix T t→t′ And a camera intrinsic internal reference matrix M, which is used for calculating and obtaining a reconstructed target image K from a source image t′→t The calculation formula is as follows:
K t′→t =K t′ [proj(D t ,T t→t′ ,M)] (4)
in the formula, K t′→t Calculating to obtain a reconstructed target image for the source image, K t′ For a source image, proj () represents a scene depth map D t Projected onto a source image K t′ M is the intrinsic internal reference matrix of the camera, D t Depth map of scene, T, being a target image t→t′ Is a relative pose transformation matrix between the target image and the source image]Representing a sampling operation;
calculating a reconstructed target image K t′→t With the actual target image K t L1 distance therebetween to obtain a reconstruction error L p The calculation formula is as follows:
Figure FDA0003783402150000041
where pe () is the photometric reconstruction error, i.e., the L1 distance in pixel space;
minimizing reconstruction error L using gradient descent method p Optimizing scene depth model parameters, performing data enhancement on a source image and a target image in the optimization process, specifically, randomly selecting an area which is cut out from the center of the original image to be 1/2 or 1/4 of the original image in size as new training data, inputting the N × M video frames obtained in the step 1 into the optimized model one by one, and generating the N × M frame scene depth map.
9. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 7, a self-adaptive fusion mode is adopted for the fusion mode of the RGB features and the scene depth features, the specific process is to standardize RGB input feature vectors to enable the mean value to be 0 and the standard deviation to be 1, then the mean value and the standard deviation are changed by using scene depth information, and the calculation mode is as follows:
Figure FDA0003783402150000042
in the formula, f (T) rgb ,T d ) For the final video level feature vector, T rgb 、T d Respectively representing RGB eigenvectors and scene depth eigenvectors, parameter fc s And fc b Learning from a full connection layer network, mu (T) rgb )、σ(T rgb ) Respectively representing the mean and standard deviation of the RGB feature vectors in L dimensions.
10. The method for identifying the abnormal behavior of the small sample based on the key area and the scene depth as claimed in claim 1, wherein: in step 8, training a small sample classifier, namely a prototype network, using a monitoring video abnormal behavior data set as training data, dividing the training data into a support set and a prediction set, randomly extracting N classes in the data set, wherein each class comprises K samples, and N multiplied by K data in total are input as the support set; extracting a batch of samples from the residual data in the N classes to be used as a prediction set of the model; when unknown abnormal behaviors are predicted, the form of input data is the same as that during training, and the model learns to judge the label of a prediction set sample through a support set; the characteristics of the support set and prediction set samples are firstly normalized by L2, and the L2 normalization formula of a vector X is as follows:
Figure FDA0003783402150000051
then through a trainable coding network, codingThe code network consists of two full-connection layers and a middle activation function Relu, the input vector dimension of the first full-connection layer is 4096, the output vector dimension is 4096, the input vector dimension of the second full-connection layer is 4096, and the output vector dimension is 1024; then calculating the mean value of each class of K samples in the support set, and taking the mean value as the prototype B of each class of samples i Calculating the prediction set sample A and each prototype B i The cosine similarity of (a) is given by the formula:
Figure FDA0003783402150000052
and obtaining the normalized probability of all cosine similarity through a softmax function, judging the type of the abnormal behavior according to the probability, and taking the abnormal behavior with the maximum corresponding probability as a final recognition result.
CN202210936032.7A 2022-08-05 2022-08-05 Small sample abnormal behavior identification method based on key region and scene depth Pending CN115439926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210936032.7A CN115439926A (en) 2022-08-05 2022-08-05 Small sample abnormal behavior identification method based on key region and scene depth

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210936032.7A CN115439926A (en) 2022-08-05 2022-08-05 Small sample abnormal behavior identification method based on key region and scene depth

Publications (1)

Publication Number Publication Date
CN115439926A true CN115439926A (en) 2022-12-06

Family

ID=84243155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210936032.7A Pending CN115439926A (en) 2022-08-05 2022-08-05 Small sample abnormal behavior identification method based on key region and scene depth

Country Status (1)

Country Link
CN (1) CN115439926A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392615A (en) * 2023-12-12 2024-01-12 南昌理工学院 Anomaly identification method and system based on monitoring video

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392615A (en) * 2023-12-12 2024-01-12 南昌理工学院 Anomaly identification method and system based on monitoring video
CN117392615B (en) * 2023-12-12 2024-03-15 南昌理工学院 Anomaly identification method and system based on monitoring video

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN111626128B (en) Pedestrian detection method based on improved YOLOv3 in orchard environment
CN111639692A (en) Shadow detection method based on attention mechanism
CN111723693B (en) Crowd counting method based on small sample learning
CN113221641B (en) Video pedestrian re-identification method based on generation of antagonism network and attention mechanism
CN113642634A (en) Shadow detection method based on mixed attention
CN113159120A (en) Contraband detection method based on multi-scale cross-image weak supervision learning
CN114972213A (en) Two-stage mainboard image defect detection and positioning method based on machine vision
CN114241053B (en) Multi-category tracking method based on improved attention mechanism FairMOT
CN110287798B (en) Vector network pedestrian detection method based on feature modularization and context fusion
WO2023030182A1 (en) Image generation method and apparatus
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN116342894B (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN112861970A (en) Fine-grained image classification method based on feature fusion
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
Zuo et al. A remote sensing image semantic segmentation method by combining deformable convolution with conditional random fields
CN111274964A (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
CN115439926A (en) Small sample abnormal behavior identification method based on key region and scene depth
CN114550023A (en) Traffic target static information extraction device
CN113936034A (en) Apparent motion combined weak and small moving object detection method combined with interframe light stream
CN113657225A (en) Target detection method
CN111833353A (en) Hyperspectral target detection method based on image segmentation
CN116934820A (en) Cross-attention-based multi-size window Transformer network cloth image registration method and system
CN114898464B (en) Lightweight accurate finger language intelligent algorithm identification method based on machine vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination