CN116386145B - Method for identifying abnormal behaviors of personnel in bank based on double cameras - Google Patents

Method for identifying abnormal behaviors of personnel in bank based on double cameras Download PDF

Info

Publication number
CN116386145B
CN116386145B CN202310407090.5A CN202310407090A CN116386145B CN 116386145 B CN116386145 B CN 116386145B CN 202310407090 A CN202310407090 A CN 202310407090A CN 116386145 B CN116386145 B CN 116386145B
Authority
CN
China
Prior art keywords
pose
interaction
human body
features
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310407090.5A
Other languages
Chinese (zh)
Other versions
CN116386145A (en
Inventor
缪仁亮
王冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZHEJIANG FINANCIAL COLLEGE
Original Assignee
ZHEJIANG FINANCIAL COLLEGE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZHEJIANG FINANCIAL COLLEGE filed Critical ZHEJIANG FINANCIAL COLLEGE
Priority to CN202310407090.5A priority Critical patent/CN116386145B/en
Publication of CN116386145A publication Critical patent/CN116386145A/en
Application granted granted Critical
Publication of CN116386145B publication Critical patent/CN116386145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a method for identifying abnormal behaviors of personnel in a bank based on double cameras, which comprises two cameras arranged in the bank at different angles, wherein the two cameras are used for shooting the personnel in the bank; acquiring a first visual angle of an image by using a first camera; acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics; inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features; and inputting the obtained pose interaction characteristics into a human body action prediction module to detect human body actions. The whole scheme can identify the abnormal behaviors of the personnel in the bank, can finish the detection of the abnormal behaviors in the bank with higher accuracy, and is beneficial to ensuring the safety of the personnel in the bank.

Description

Method for identifying abnormal behaviors of personnel in bank based on double cameras
Technical Field
The application relates to the technical field of gesture recognition, in particular to a method for recognizing abnormal behaviors of personnel in a bank based on double cameras.
Background
Motion recognition is a natural extension of the image classification field to the video field, and the accuracy of the current deep learning algorithm in image classification is already higher than that of the ordinary person, but the progress of deep learning in the motion recognition field is not as remarkable as that in the image classification field.
At present, the application of action recognition is very wide, such as adding action recognition alarm in a bank monitoring area, and real-time monitoring can be realized. The image data acquired by the camera is automatically analyzed in time, and related personnel can be timely and rapidly notified once abnormal conditions are found, so that extreme conditions are avoided.
Abnormal behavior detection in banks is carried out under the condition that the real-time performance and accuracy rate of the related algorithm are difficult to reconcile in detection at present, and most abnormal detection of banks is completed through a single camera, so that the detection effect is still to be improved.
Disclosure of Invention
Aiming at the problems, the application provides the method for identifying the abnormal behaviors of the personnel in the bank based on the double cameras, which can finish the detection of the abnormal behaviors in the bank with higher accuracy, has higher running speed and can effectively solve the problems in the background technology.
In order to achieve the above purpose, the present application provides the following technical solutions:
a method for identifying abnormal behaviors of personnel in a bank based on double cameras is characterized by comprising the following steps of:
Included
the two cameras are arranged in the bank room at different angles, and the first camera and the second camera are arranged on the bank room;
shooting a person in a bank by using two cameras;
acquiring a first view angle of an image by using a first camera, processing the first view angle of the image to obtain a first view angle picture, and sending the first view angle picture to a pose mask module to obtain pose mask characteristics;
acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics;
inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features;
the obtained pose interaction characteristics are input into a human body action prediction module, human body actions are detected, and abnormal behaviors of people in a bank are identified.
As a preferable technical scheme, the application also comprises a human body posture prediction module;
the human body posture prediction module is formed by deconvolution, acquires the pose interaction characteristics, recovers the pose interaction characteristics by deconvolution, aligns the pose interaction characteristics with joint characteristics of the human body posture to obtain skeleton characteristics of the human body posture, and detects the human body posture;
the human body action prediction module acquires the human body posture framework characteristics and fuses the human body posture framework characteristics and the pose interaction characteristics;
in the process of fusing the human body posture framework features and the pose interaction features, the human body posture framework features are subjected to convolution downsampling by means of convolution, the dimensions of the human body posture framework features and the pose interaction features are aligned, and matrix addition operation is utilized to carry out addition fusion on the human body posture framework features and the pose interaction features.
As a preferred technical scheme of the application, the pose masking module obtains a first view angle picture, and performs masking operation on the first view angle picture, wherein the masking operation process at least comprises convolution with a convolution kernel of 3×3;
the pose coding module acquires a second view angle picture and performs coding operation on the second view angle picture, wherein the coding operation process consists of a ResNet network, and after an image is coded, a dimension flattening unit Flatten is utilized for flattening to obtain pose coding characteristics;
taking the pose mask features as a query sequence Q in an interaction attention matrix;
the pose coding features are used as a keyword sequence K and a value sequence V in an interaction attention matrix;
respectively carrying out position coding on the query sequence and the keyword sequence;
and inputting the query sequence and the keyword sequence after the value sequence and the position code into a pose interaction module to realize interaction attention operation and obtain pose interaction characteristics.
As a preferred technical scheme of the application, the pose coding module adopts a residual module formed by ResNet18, after a second view angle picture is obtained, the resolution of the second view angle picture is 1/256 of the resolution of a second view angle of an image, the obtained second view angle picture is a low resolution picture, the low resolution picture is subjected to convolution coding to obtain the low resolution characteristic of the whole image of the second view angle picture, and the pose coding characteristic is obtained by flattening the whole image of the second view angle picture through a dimension flattening unit flat.
As a preferable technical scheme of the application, the pose mask module comprises two convolutions with convolution kernels of 3 multiplied by 3 and a slicing unit;
the masking operation of the pose masking module comprises the following steps:
after the pose mask module obtains a first view angle picture, the resolution of the first view angle picture is 1/64 of the resolution of the first view angle of the image, and the obtained first view angle picture is a high resolution picture;
respectively carrying out convolution and slicing on the high-resolution picture by using a convolution with a convolution kernel of 3 multiplied by 3 and a slicing unit to respectively obtain coarse mask information and slice characteristics;
the coarse mask information is flattened through the flat and is consistent with the slice feature dimension, and the slice feature is subjected to coarse mask information matching, so that slice feature mask operation is realized, and intermediate mask features are obtained;
performing convolution operation on the coarse mask information by using convolution with the other convolution kernel of 3 multiplied by 3 to obtain fine mask information, flattening the fine mask information by using the flat to be consistent with the size of the middle mask feature, performing fine mask information matching on the middle mask feature, realizing the middle mask feature mask operation and obtaining pose coding features;
the pose coding features are high-resolution features after the whole image is masked.
As a preferable technical scheme of the application, the pose interaction module comprises a space perception interaction Attention Spatial perception Multi-Head Cross-Attention (S-MHCA) and a multi-layer perception machine Multilayer Perceptron (MLP);
the spatially-aware interaction focus comprises a spatially-aware unit F;
after a query sequence and a keyword sequence after the value sequence and the position coding are acquired, after matrix calculation is performed on the query sequence Q and the keyword sequence K, a space sensing unit F is utilized to perform space sensing on the query sequence Q and the keyword sequence K, and a space sensing feature N is obtained:
N=F(QK T )
the space perception interaction attention calculates a value sequence V and a space perception feature N to obtain a space perception interaction feature M;
inputting the space interaction perception feature M into the multi-layer perception machine, and processing different channels of the space interaction perception feature and the space feature to obtain pose interaction features; d (D) h Is a constant and 256 is taken in the present application.
And the space perception interaction attention and the multi-layer perception machine are connected by adopting residual errors.
As a preferred technical solution of the present application, the spatial perception unit includes a convolution with a convolution kernel of 1×1;
after the matrix calculation is carried out on the query sequence Q and the keyword sequence K, matrix calculation characteristics are obtained, and the space perception unit carries out the following processing on the matrix calculation characteristics:
performing Layer standardization processing on the matrix calculation characteristics by using Layer standardization Layer Norm to obtain standardized information characteristics;
performing dimension conversion on the standardized information features by using dimension conversion, performing convolution processing by convolution kernel of 1×1, and performing feature activation and dimension conversion by using Gelu to obtain space perception features with the same size as the matrix calculation features;
the convolution processing is carried out through convolution with the convolution kernel of 1 multiplied by 1, so that the characteristics with the same size as the second view angle picture are obtained, the keyword sequence K and the value sequence V are both from the second view angle picture, the image resolution is low, the space is unfolded from one dimension to two dimensions by utilizing dimension conversion, and the space perception capability of the image is improved.
As a preferable technical scheme of the application, in the network training process, the human body posture prediction module calculates the obtained human body posture skeleton characteristics and the human body posture skeleton true value by adopting a mean square error Loss Mean Squared Error Loss to obtain Loss2;
the human body motion prediction module calculates the detected human body motion and the human body motion true value by adopting cross entropy Loss Cross Entropy Loss to obtain Loss1;
reverse gradient return is performed by using the following formula to complete the training process
Loss=αLoss1+βLoss2
Wherein α and β are values between 0 and 1.
Compared with the prior art, the application has the beneficial effects that:
1. the method comprises the steps of detecting the same user by using two cameras with different angles, processing one camera into a high-resolution picture, performing pose masking by using a pose masking module, extracting a query sequence Q, removing background interference in the masking process, improving the image processing speed, processing the other camera into a low-resolution picture, obtaining pose coding characteristics by using a pose coding module, improving the running speed, providing the pose coding characteristics and the pose coding characteristics for spatial perception interaction attention designed by us, improving the running speed of a model, and guaranteeing the accuracy.
2. The existing Transformer model is applied to human behavior prediction, but two images with different angles cannot be jointly processed, and a real-time effect is difficult to achieve, the method utilizes an interaction Attention Cross-Attention mechanism to respectively encode and mask information of the same pose (a first visual angle picture and a second visual angle picture) with different angles at the input end of a pose interaction module, the pose encoding characteristic can effectively ensure integral information, a pose masking module can effectively pay Attention to the pose of personnel in a bank, and the pose information are jointly input to the Cross-Attention, so that the action recognition accuracy is improved, and the running speed is ensured.
3. By adopting double Loss supervision, the human body posture is predicted, and the behavior is predicted, so that the abnormal behavior prediction of the bank is more accurate, the human body posture prediction module is used as an intermediate supervision process, the accuracy of motion prediction is effectively improved, and the training process has stronger robustness.
Drawings
FIG. 1 is a schematic flow chart of the method of the present application;
FIG. 2 is a schematic flow diagram of each processing module of the method of the present application;
FIG. 3 is a schematic diagram of a pose mask module according to the method of the present application;
FIG. 4 is a second schematic diagram of a pose mask module according to the method of the present application;
FIG. 5 is a schematic diagram of the spatially aware interaction of attention according to the method of the present application;
FIG. 6 is a diagram showing the result of the recognition of the motion of the method of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Examples:
referring to fig. 1 to 6, the present application provides a technical solution: a recognition method of abnormal behaviors of personnel in a bank based on double cameras comprises two cameras, a first camera and a second camera, wherein the two cameras are arranged in the bank at different angles;
shooting a person in a bank by using two cameras;
acquiring a first view angle of an image by using a first camera, processing the first view angle of the image to obtain a first view angle picture, and sending the first view angle picture to a pose mask module to obtain pose mask characteristics;
acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics;
inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features;
the obtained pose interaction characteristics are input into a human body action prediction module, human body actions are detected, and abnormal behaviors of people in a bank are identified.
In our method, the first view angle of the second view angle of the image is processed by using a convolution kernel of 5×5, and when the large convolution is used for processing the image, as much relevant information of the image as possible can be acquired, because the size of the convolution kernel is related to the receptive field, different step sizes and padding operations are adopted to obtain second view angle pictures and first view angle pictures with different sizes.
In the application, the image size of the first view angle picture is [ C, H/8,W/8], the image size of the second view angle picture is [ C, H/16, W/16], wherein C refers to an image channel, W and H refer to the width and the height of an image, so that the processed first view angle picture and the processed second view angle picture are different in size, different-size pictures can be provided for subsequent encoding and masking, the running speed of a network is improved, and the recognition accuracy is ensured.
In one embodiment of the application, the human body posture prediction system further comprises a human body posture prediction module;
the human body posture prediction module is formed by deconvolution, acquires the pose interaction characteristics, recovers the pose interaction characteristics by deconvolution, aligns the pose interaction characteristics with joint characteristics of the human body posture to obtain skeleton characteristics of the human body posture, and detects the human body posture;
the human body action prediction module acquires the human body posture framework characteristics and fuses the human body posture framework characteristics and the pose interaction characteristics;
in the process of fusing the human body posture framework features and the pose interaction features, the human body posture framework features are subjected to convolution downsampling by means of convolution, the dimensions of the human body posture framework features and the pose interaction features are aligned, and matrix addition operation is utilized to carry out addition fusion on the human body posture framework features and the pose interaction features.
Deconvolution is realized by adopting a Deconv module in Simple Baselines for Human Pose Estimation and Tracking, and the deconvolution is utilized to recover the pose interaction characteristics, so that the image size is recovered to be consistent with the original image height and width, and the deconvolution can be calculated with the true value in the data set, and gradient reverse feedback is realized after supervision.
In the application, the output of the human body posture skeleton feature can further ensure the accuracy of motion prediction, double-layer supervision is realized, and compared with the previous supervision mode, the model can ensure more effective motion recognition of personnel in a bank in the training process, and the supervision of the human body posture skeleton feature is added to ensure the detection accuracy of abnormal behaviors in the bank.
In addition, in the network operation process, the dimension alignment of the human body posture framework features and the pose interaction features is completed by utilizing convolution with the convolution kernel size of 3×3, so that the dimension consistency of the human body posture framework features and the pose interaction features is ensured, the network can be trained normally, the human body posture framework features are used as intermediate supervision features to be added into the pose interaction features, and the effectiveness of pose interaction feature training is further improved.
In one embodiment of the present application, the pose masking module obtains a first view image, and performs masking operation on the first view image, where the masking operation process at least includes convolution with a convolution kernel of 3×3;
the pose coding module acquires a second view angle picture and performs coding operation on the second view angle picture, wherein the coding operation process consists of a ResNet network, and after an image is coded, a dimension flattening unit Flatten is utilized for flattening to obtain pose coding characteristics;
taking the pose mask features as a query sequence Q in an interaction attention matrix;
the pose coding features are used as a keyword sequence K and a value sequence V in an interaction attention matrix;
respectively carrying out position coding on the query sequence and the keyword sequence;
and inputting the query sequence and the keyword sequence after the value sequence and the position code into a pose interaction module to realize interaction attention operation and obtain pose interaction characteristics.
At present, most of the transform methods adopt a query sequence Q, a keyword sequence K and a value sequence V to perform self-Attention operation, for example, a paper Attention is all you need related to Vit and an latest motion prediction paper AIM ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION, basically, the mode does not consider the relevance between different pictures, but the double cameras acquire images at different angles, and the scheme is difficult to ensure the product relevance between the two angles by self-Attention.
In an embodiment of the present application, the pose coding module uses a residual module composed of res net18, after the second view angle picture is obtained, the resolution of the second view angle picture is 1/256 of the resolution of the second view angle of the image, the obtained second view angle picture is a low resolution picture, and the low resolution picture is subjected to convolutional coding to obtain the overall image low resolution characteristic of the second view angle picture, namely, the pose coding characteristic.
In one embodiment of the application, the pose masking module comprises two convolutions with convolution kernels of 3×3 and a slicing unit, so as to realize slicing feature masking operation and intermediate masking feature masking operation;
after the pose mask module obtains a first view angle picture, the resolution of the first view angle picture is 1/64 of the resolution of the first view angle of the image, and the obtained first view angle picture is a high resolution picture;
respectively carrying out convolution and slicing on the high-resolution picture by using a convolution with a convolution kernel of 3 multiplied by 3 and a slicing unit to respectively obtain coarse mask information and slice characteristics; the slicing unit is realized by adopting dimension conversion Reshape. The method comprises the steps of performing dimension conversion on a high-resolution picture to obtain a plurality of small matrixes with the same size, wherein the small matrixes are image blocks with the same size from the physical sense, the patch slicing operation in a transducer module in the image field is identical, the image blocks can be subjected to position coding, and position confusion in the training process is prevented, and the position coding belongs to the prior art and is not repeated.
The coarse mask information is flattened through the flat and is consistent with the slice feature dimension, and the slice feature is subjected to coarse mask information matching, so that slice feature mask operation is realized, and intermediate mask features are obtained;
in fig. 3 and 4, the person is walking and is in a gesture, background information is gradually covered in the gesture masking process, and the gesture of walking of the person is left, so that the processing is convenient. In the process of matching the rough mask information, the rough mask information is added with slice features to realize the mask of the slice features, so that the position of an image to be focused is determined, the joint part of a person in the image is highlighted, and the non-person part is covered, because the bank machines are more, the acquired image contains background information, on one hand, the accuracy judgment of the action of the person is affected by processing, on the other hand, the operation process is more troublesome, and the accuracy of the operation can be improved by covering the background information.
The intermediate mask feature masking operation includes:
performing convolution operation on the coarse mask information by using convolution with the other convolution kernel of 3 multiplied by 3 to obtain fine mask information, flattening the fine mask information by using the flat to be consistent with the size of the middle mask feature, and performing fine mask information matching on the middle mask feature to obtain pose coding feature;
the pose coding features are high-resolution features after the whole image is masked.
As shown in fig. 3 and fig. 4, in the method of the present application, the pose coding feature is a high resolution feature after the whole image is masked, the person part image is obtained instead of the whole image, on one hand, since the pose masking module processes the high resolution image itself, if the pose masking module does not mask the high resolution image, the pose masking module always processes the whole image, the memory consumption is too large, the calculation is complex, and the speed is affected.
In addition, in the method, the pose coding module processes the whole picture with low resolution, so that the extraction of background information can be ensured, the whole information distribution can be effectively perceived as a keyword sequence K and a value sequence V in a follow-up pose interaction model, the integrity of personnel detection in a bank is improved, and the processing speed of the whole picture can be effectively improved by adopting the picture with a second visual angle with lower resolution.
In one embodiment of the application, the pose interaction module comprises a spatially aware interaction Attention Spatial perception Multi-Head Cross-Attention (S-MHCA) and a multi-layer perceptron Multilayer Perceptron (MLP);
the spatially-aware interaction focus comprises a spatially-aware unit F;
after a query sequence and a keyword sequence after the value sequence and the position coding are acquired, after matrix calculation is performed on the query sequence Q and the keyword sequence K, a space sensing unit F is utilized to perform space sensing on the query sequence Q and the keyword sequence K, and a space sensing feature N is obtained:
N=F(QK T )
the space perception interaction attention calculates a value sequence V and a space perception feature N to obtain a space perception interaction feature M;
softmax is the activation function and,middle D h For multi-head operation, D is the application h The value is 256. Inputting the space interaction perception feature M into the multi-layer perception machine, and processing different channels of the space interaction perception feature and the space feature to obtain pose interaction features;
and the space perception interaction attention and the multi-layer perception machine are connected by adopting residual errors.
In the method, the high resolution of the first view angle picture is utilized, the mask is adopted to remove the background interference, the attention to the person is improved, the whole scene of the second view angle image is acquired, the recognition error caused by the excessive mask is prevented, and the interactive attention mechanism is adopted to complete the training process, so that the method is rare in the field at present. Under the operation of spatially sensing interaction attention, interaction of the second view image and the first view image is realized, and under the spatial sensing unit, the spatial sensing capability of the second view image is improved.
Further, the spatial perception unit comprises convolution with a convolution kernel of 1×1;
after the matrix calculation is carried out on the query sequence Q and the keyword sequence K, matrix calculation characteristics are obtained, and the space perception unit carries out the following processing on the matrix calculation characteristics:
performing Layer standardization processing on the matrix calculation characteristics by using Layer standardization Layer Norm to obtain standardized information characteristics;
performing dimension conversion on the standardized information features by using dimension conversion, performing convolution processing by convolution kernel of 1×1, and performing feature activation and dimension conversion by using Gelu to obtain space perception features with the same size as the matrix calculation features;
the convolution processing is carried out through convolution with the convolution kernel of 1 multiplied by 1, so that the characteristics with the same size as the second view angle picture are obtained, the keyword sequence K and the value sequence V are both from the second view angle picture, the image resolution is low, the space is unfolded from one dimension to two dimensions by utilizing dimension conversion, and the space perception capability of the image is improved.
As shown in fig. 5, in the method of the present application, the resolution of the second view image is low, so that the formed keyword sequence K and value sequence V can be expanded into a two-dimensional form under the operation of spatially perceiving the interaction attention, that is, the expansion of the spatial features H and W occurs, the convolution kernel is a convolution of 1×1, the spatial feature convolution process is performed, the step size is 1, and then the activation function Gelu is used to activate the convolution kernel, so as to improve the spatial perception capability of the second view image.
Further, in the network training process, the human body posture prediction module calculates the obtained human body posture skeleton characteristics and the human body posture skeleton true value by adopting a mean square error Loss Mean Squared Error Loss to obtain Loss2;
the human body motion prediction module calculates the detected human body motion and the human body motion true value by adopting cross entropy Loss Cross Entropy Loss to obtain Loss1;
reverse gradient return is performed by using the following formula to complete the training process
Loss=αLoss1+βLcoss2
Wherein, alpha and beta are values between 0 and 1, excluding 0 and 1, and generally the sum of the two values is 1, and other values can be taken, and in the training and testing process, the values of beta and beta are both 0.5.
In a laboratory environment, a GeForce RTX 2080Ti graphic card is adopted, when the values of beta and beta are both 0.5, the self-built data set and the official action test data set are tested, the size of an image obtained by a camera is 256 multiplied by 192, the overall running speed of the model is 132fps (average value in one minute after the test is stable), and the accuracy rate reaches 91%. In addition, in order to show the effectiveness of our method under the official data set, the human body posture prediction module adopts 256×192 images on the COCO data set, and inputs the images with the same visual angle by using the double cameras, and the average accuracy AP reaches 74.2, which exceeds the result of A Fast and Effective Transformer for Human Pose Estimation, thus proving the effectiveness of our method, and the running average speed is 157fps (average value within one minute after the test is stable). fps is the number of transmission frames per second (Frames Per Second). Fig. 6 shows our final result, the user can designate the identified action as an abnormal action, and if the identified action is designated to fall down to be an abnormal action (people climbing on the ground of the bank can be considered to be abnormal), the alarm can be triggered to remind the background personnel that someone in the bank is faint, so that the alarm of the abnormal action is realized.
The working principle of the application is as follows: shooting a person in a bank by using two cameras arranged in the bank at different angles; acquiring a first view angle of an image by using a first camera, processing the first view angle of the image to obtain a first view angle picture, and sending the first view angle picture to a pose mask module to obtain pose mask characteristics; acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics; inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features; the obtained pose interaction characteristics are input into a human body action prediction module, human body actions are detected, and abnormal behaviors of people in a bank are identified.
In our method, the first view angle of the second view angle of the image is processed by using a convolution kernel of 5×5, and when the large convolution is used for processing the image, as much relevant information of the image as possible can be acquired, because the size of the convolution kernel is related to the receptive field, different step sizes and padding operations are adopted to obtain second view angle pictures and first view angle pictures with different sizes.
The pose coding feature is a high-resolution feature after the whole image is masked, the character part image is obtained after the whole image is masked, but not the whole image, because the bank machines are more, the obtained image can be very slow to process if the obtained image contains background information, on one hand, the pose masking module processes the high-resolution image, if the pose masking module does not mask the high-resolution image, the pose masking module always processes the high-resolution image, the memory consumption is overlarge, the calculation is complex, and the speed is influenced.
The pose coding module is used for processing the whole picture all the time, can ensure the extraction of background information, can effectively sense global information distribution as a keyword sequence K and a value sequence V in a follow-up pose interaction model, improves the integrity of personnel detection in a bank, and can effectively improve the processing speed of the whole picture by adopting a second visual angle picture with lower resolution.
In one embodiment of the present application, both β and β take values of 0.5, the running speed is 132fps, the accuracy exceeds 90%, and fig. 6 shows the final detection result of the present application, and in order to embody the performance of the present application, the human body posture skeleton feature is also shown.
The pose interaction module utilizes the high resolution of the first visual angle picture, removes background interference, improves the attention to people, and prevents recognition errors caused by excessive mask after the whole scene of the acquired second visual angle image, thereby completing the training process. Interaction of the second view image and the first view image is achieved by using the space perception interaction attention in the pose interaction module, and the space perception capability of the second view image is improved under the space perception unit.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims (5)

1. A method for identifying abnormal behaviors of personnel in a bank based on double cameras is characterized by comprising the following steps of:
Included
the two cameras are arranged in the bank room at different angles, and the first camera and the second camera are arranged on the bank room;
shooting a person in a bank by using two cameras;
acquiring a first view angle of an image by using a first camera, processing the first view angle of the image to obtain a first view angle picture, and sending the first view angle picture to a pose mask module to obtain pose mask characteristics;
acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics;
inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features;
inputting the obtained pose interaction characteristics into a human body action prediction module, detecting human body actions, and identifying abnormal behaviors of people in a bank;
the human body posture prediction system also comprises a human body posture prediction module;
the human body posture prediction module is formed by deconvolution, acquires the pose interaction characteristics, recovers the pose interaction characteristics by deconvolution, aligns the pose interaction characteristics with joint characteristics of the human body posture to obtain skeleton characteristics of the human body posture, and detects the human body posture;
the human body action prediction module acquires the human body posture framework characteristics and fuses the human body posture framework characteristics and the pose interaction characteristics;
in the process of fusing the human body posture framework features and the pose interaction features, the human body posture framework features are subjected to convolution downsampling by means of convolution, the dimensions of the human body posture framework features and the pose interaction features are aligned, and matrix addition operation is utilized to carry out addition fusion on the human body posture framework features and the pose interaction features;
the pose masking module acquires a first view image, and masking operation is carried out on the first view image, wherein the masking operation process at least comprises convolution with a convolution kernel of 3 multiplied by 3;
the pose coding module acquires a second view angle picture and performs coding operation on the second view angle picture, wherein the coding operation process consists of a ResNet network, and after an image is coded, a dimension flattening unit Flatten is utilized for flattening to obtain pose coding characteristics;
taking the pose mask features as a query sequence Q in an interaction attention matrix;
the pose coding features are used as a keyword sequence K and a value sequence V in an interaction attention matrix;
respectively carrying out position coding on the query sequence and the keyword sequence;
inputting the query sequence and the keyword sequence after the value sequence and the position code into a pose interaction module to realize interaction attention operation and obtain pose interaction characteristics;
the encoding operation process of the pose encoding module comprises the following steps:
after a residual error module formed by ResNet18 is adopted to obtain a second view angle picture, the resolution of the second view angle picture is 1/256 of the resolution of a second view angle of the image, the obtained second view angle picture is a low resolution picture, and convolutional encoding is carried out on the low resolution picture to obtain the low resolution characteristic of the whole image of the second view angle picture;
the pose mask module comprises two convolutions with convolution kernels of 3 multiplied by 3 and a slicing unit;
the pose masking module masking operation of the first view image comprises the following steps:
after the pose mask module obtains a first view angle picture, the resolution of the first view angle picture is 1/64 of the resolution of the first view angle of the image, and the obtained first view angle picture is a high resolution picture;
respectively carrying out convolution and slicing on the high-resolution picture by using a convolution with a convolution kernel of 3 multiplied by 3 and a slicing unit to respectively obtain coarse mask information and slice characteristics;
the coarse mask information is flattened through the flat and is consistent with the slice feature dimension, and the slice feature is subjected to coarse mask information matching, so that slice feature mask operation is realized, and intermediate mask features are obtained;
performing convolution operation on the coarse mask information by using convolution with the other convolution kernel of 3 multiplied by 3 to obtain fine mask information, flattening the fine mask information by using the flat to be consistent with the size of the middle mask feature, performing fine mask information matching on the middle mask feature, realizing the middle mask feature mask operation and obtaining pose coding features;
the pose interaction module comprises a space perception interaction Attention Spatial perception Multi-Head Cross-Attention (S-MHCA) and a multi-layer perception machine Multilayer Perceptron (MLP); the spatially-aware interaction focus comprises a spatially-aware unit F;
after obtaining a query sequence and a keyword sequence after value sequence and position coding, performing matrix calculation on the query sequence Q and the keyword sequence K, and performing space sensing through a space sensing unit F by using the following formula to obtain a space sensing feature N:
N=F(QK T )
the space perception interaction attention pair value sequence V and the space perception feature N are calculated by the following formula to obtain a space perception interaction feature M;
and inputting the space interaction perception feature M into the multi-layer perception machine, and processing different channels of the space interaction perception feature and the space feature to obtain the pose interaction feature.
2. The method for identifying abnormal behaviors of personnel in a bank based on double cameras as claimed in claim 1, wherein the method comprises the following steps:
the pose coding features are high-resolution features of the whole image after masking, and the spatial perception interaction attention and the multi-layer perception machine are connected by residual errors.
3. The method for identifying abnormal behaviors of personnel in a bank based on double cameras according to claim 1 or 2, wherein the method comprises the following steps:
the spatial perception unit comprises convolution with a convolution kernel of 1 x 1;
after the matrix calculation is carried out on the query sequence Q and the keyword sequence K, matrix calculation characteristics are obtained, and the following processing is carried out by utilizing the space perception unit:
performing Layer standardization processing on the matrix calculation characteristics by using Layer standardization Layer Norm to obtain standardized information characteristics;
performing dimension conversion on the standardized information features by using dimension conversion, performing convolution processing by convolution kernel of 1×1, and performing feature activation and dimension conversion by using Gelu to obtain space perception features with the same size as the matrix calculation features;
the convolution processing is performed by convolution with a convolution kernel of 1×1, so as to obtain features with the same size as the second view picture, where the key word sequence K and the value sequence V are both from the second view picture.
4. A method for identifying abnormal behaviors of a person in a bank based on double cameras as claimed in claim 3, wherein:
in the network training process, the human body posture prediction module calculates the obtained human body posture skeleton characteristics and the human body posture skeleton true value by adopting a mean square error Loss Mean Squared Error Loss to obtain Loss2;
the human body motion prediction module calculates the detected human body motion and the human body motion true value by adopting cross entropy Loss Cross Entropy Loss to obtain Loss1;
reverse gradient return is performed by using the following formula to complete the training process
Loss=αloss1+βloss2, where α and β are values between 0 and 1, excluding 0 and 1.
5. The method for identifying abnormal behaviors of personnel in a bank based on double cameras as claimed in claim 1, wherein the method comprises the following steps: the slicing unit is realized by adopting dimension conversion Reshape.
CN202310407090.5A 2023-04-17 2023-04-17 Method for identifying abnormal behaviors of personnel in bank based on double cameras Active CN116386145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310407090.5A CN116386145B (en) 2023-04-17 2023-04-17 Method for identifying abnormal behaviors of personnel in bank based on double cameras

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310407090.5A CN116386145B (en) 2023-04-17 2023-04-17 Method for identifying abnormal behaviors of personnel in bank based on double cameras

Publications (2)

Publication Number Publication Date
CN116386145A CN116386145A (en) 2023-07-04
CN116386145B true CN116386145B (en) 2023-11-03

Family

ID=86965456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310407090.5A Active CN116386145B (en) 2023-04-17 2023-04-17 Method for identifying abnormal behaviors of personnel in bank based on double cameras

Country Status (1)

Country Link
CN (1) CN116386145B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA1306310C (en) * 1987-11-13 1992-08-11 Shreyaunsh R. Shah Distributed computer system
CN111523378A (en) * 2020-03-11 2020-08-11 浙江工业大学 Human behavior prediction method based on deep learning
CN112530437A (en) * 2020-11-18 2021-03-19 北京百度网讯科技有限公司 Semantic recognition method, device, equipment and storage medium
CN112733707A (en) * 2021-01-07 2021-04-30 浙江大学 Pedestrian re-identification method based on deep learning
CN113988086A (en) * 2021-09-29 2022-01-28 阿里巴巴达摩院(杭州)科技有限公司 Conversation processing method and device
CN114550305A (en) * 2022-03-04 2022-05-27 合肥工业大学 Human body posture estimation method and system based on Transformer
CN114627555A (en) * 2022-03-15 2022-06-14 淮阴工学院 Human body action recognition method, system and equipment based on shunt attention network
CN114817494A (en) * 2022-04-02 2022-07-29 华南理工大学 Knowledge type retrieval type dialogue method based on pre-training and attention interaction network
CN114898734A (en) * 2022-05-20 2022-08-12 北京百度网讯科技有限公司 Pre-training method and device based on speech synthesis model and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA1306310C (en) * 1987-11-13 1992-08-11 Shreyaunsh R. Shah Distributed computer system
CN111523378A (en) * 2020-03-11 2020-08-11 浙江工业大学 Human behavior prediction method based on deep learning
CN112530437A (en) * 2020-11-18 2021-03-19 北京百度网讯科技有限公司 Semantic recognition method, device, equipment and storage medium
CN112733707A (en) * 2021-01-07 2021-04-30 浙江大学 Pedestrian re-identification method based on deep learning
CN113988086A (en) * 2021-09-29 2022-01-28 阿里巴巴达摩院(杭州)科技有限公司 Conversation processing method and device
CN114550305A (en) * 2022-03-04 2022-05-27 合肥工业大学 Human body posture estimation method and system based on Transformer
CN114627555A (en) * 2022-03-15 2022-06-14 淮阴工学院 Human body action recognition method, system and equipment based on shunt attention network
CN114817494A (en) * 2022-04-02 2022-07-29 华南理工大学 Knowledge type retrieval type dialogue method based on pre-training and attention interaction network
CN114898734A (en) * 2022-05-20 2022-08-12 北京百度网讯科技有限公司 Pre-training method and device based on speech synthesis model and electronic equipment

Also Published As

Publication number Publication date
CN116386145A (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN110969124B (en) Two-dimensional human body posture estimation method and system based on lightweight multi-branch network
CN112580523A (en) Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium
CN103324919B (en) Video monitoring system and data processing method thereof based on recognition of face
CN111160149B (en) Vehicle-mounted face recognition system and method based on motion scene and deep learning
CN110992414B (en) Indoor monocular scene depth estimation method based on convolutional neural network
CN116453067B (en) Sprinting timing method based on dynamic visual identification
CN103886585A (en) Video tracking method based on rank learning
CN111027440B (en) Crowd abnormal behavior detection device and detection method based on neural network
CN112766186A (en) Real-time face detection and head posture estimation method based on multi-task learning
CN114220143A (en) Face recognition method for wearing mask
CN112733710A (en) Method for training a neural network for irrigation water pressure control of an irrigation device
Elshwemy et al. A New Approach for Thermal Vision based Fall Detection Using Residual Autoencoder.
CN116386145B (en) Method for identifying abnormal behaviors of personnel in bank based on double cameras
Zhou End-to-end video violence detection with transformer
Sun et al. UAV image detection algorithm based on improved YOLOv5
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
Verma et al. Intensifying security with smart video surveillance
Ren et al. Research on Safety Helmet Detection for Construction Site
CN113743339B (en) Indoor falling detection method and system based on scene recognition
CN117132914B (en) Method and system for identifying large model of universal power equipment
CN114862920A (en) Cross-camera pedestrian re-identification method and device based on multi-scale image recovery
Elshwemy et al. An enhanced fall detection approach in smart homes using optical flow and residual autoencoder
CN114999648B (en) Early screening system, equipment and storage medium for cerebral palsy based on baby dynamic posture estimation
CN117809151A (en) Infrared dim target detection method based on space-time attention fusion coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant