CN116386145B

CN116386145B - Method for identifying abnormal behaviors of personnel in bank based on double cameras

Info

Publication number: CN116386145B
Application number: CN202310407090.5A
Authority: CN
Inventors: 缪仁亮; 王冬
Original assignee: ZHEJIANG FINANCIAL COLLEGE
Current assignee: ZHEJIANG FINANCIAL COLLEGE
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-11-03
Anticipated expiration: 2043-04-17
Also published as: CN116386145A

Abstract

The application discloses a method for identifying abnormal behaviors of personnel in a bank based on double cameras, which comprises two cameras arranged in the bank at different angles, wherein the two cameras are used for shooting the personnel in the bank; acquiring a first visual angle of an image by using a first camera; acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics; inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features; and inputting the obtained pose interaction characteristics into a human body action prediction module to detect human body actions. The whole scheme can identify the abnormal behaviors of the personnel in the bank, can finish the detection of the abnormal behaviors in the bank with higher accuracy, and is beneficial to ensuring the safety of the personnel in the bank.

Description

Method for identifying abnormal behaviors of personnel in bank based on double cameras

Technical Field

The application relates to the technical field of gesture recognition, in particular to a method for recognizing abnormal behaviors of personnel in a bank based on double cameras.

Background

Motion recognition is a natural extension of the image classification field to the video field, and the accuracy of the current deep learning algorithm in image classification is already higher than that of the ordinary person, but the progress of deep learning in the motion recognition field is not as remarkable as that in the image classification field.

At present, the application of action recognition is very wide, such as adding action recognition alarm in a bank monitoring area, and real-time monitoring can be realized. The image data acquired by the camera is automatically analyzed in time, and related personnel can be timely and rapidly notified once abnormal conditions are found, so that extreme conditions are avoided.

Abnormal behavior detection in banks is carried out under the condition that the real-time performance and accuracy rate of the related algorithm are difficult to reconcile in detection at present, and most abnormal detection of banks is completed through a single camera, so that the detection effect is still to be improved.

Disclosure of Invention

Aiming at the problems, the application provides the method for identifying the abnormal behaviors of the personnel in the bank based on the double cameras, which can finish the detection of the abnormal behaviors in the bank with higher accuracy, has higher running speed and can effectively solve the problems in the background technology.

In order to achieve the above purpose, the present application provides the following technical solutions:

a method for identifying abnormal behaviors of personnel in a bank based on double cameras is characterized by comprising the following steps of:

Included

the two cameras are arranged in the bank room at different angles, and the first camera and the second camera are arranged on the bank room;

shooting a person in a bank by using two cameras;

acquiring a first view angle of an image by using a first camera, processing the first view angle of the image to obtain a first view angle picture, and sending the first view angle picture to a pose mask module to obtain pose mask characteristics;

acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics;

inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features;

the obtained pose interaction characteristics are input into a human body action prediction module, human body actions are detected, and abnormal behaviors of people in a bank are identified.

As a preferable technical scheme, the application also comprises a human body posture prediction module;

the human body posture prediction module is formed by deconvolution, acquires the pose interaction characteristics, recovers the pose interaction characteristics by deconvolution, aligns the pose interaction characteristics with joint characteristics of the human body posture to obtain skeleton characteristics of the human body posture, and detects the human body posture;

the human body action prediction module acquires the human body posture framework characteristics and fuses the human body posture framework characteristics and the pose interaction characteristics;

in the process of fusing the human body posture framework features and the pose interaction features, the human body posture framework features are subjected to convolution downsampling by means of convolution, the dimensions of the human body posture framework features and the pose interaction features are aligned, and matrix addition operation is utilized to carry out addition fusion on the human body posture framework features and the pose interaction features.

As a preferred technical scheme of the application, the pose masking module obtains a first view angle picture, and performs masking operation on the first view angle picture, wherein the masking operation process at least comprises convolution with a convolution kernel of 3×3;

the pose coding module acquires a second view angle picture and performs coding operation on the second view angle picture, wherein the coding operation process consists of a ResNet network, and after an image is coded, a dimension flattening unit Flatten is utilized for flattening to obtain pose coding characteristics;

taking the pose mask features as a query sequence Q in an interaction attention matrix;

the pose coding features are used as a keyword sequence K and a value sequence V in an interaction attention matrix;

respectively carrying out position coding on the query sequence and the keyword sequence;

and inputting the query sequence and the keyword sequence after the value sequence and the position code into a pose interaction module to realize interaction attention operation and obtain pose interaction characteristics.

As a preferred technical scheme of the application, the pose coding module adopts a residual module formed by ResNet18, after a second view angle picture is obtained, the resolution of the second view angle picture is 1/256 of the resolution of a second view angle of an image, the obtained second view angle picture is a low resolution picture, the low resolution picture is subjected to convolution coding to obtain the low resolution characteristic of the whole image of the second view angle picture, and the pose coding characteristic is obtained by flattening the whole image of the second view angle picture through a dimension flattening unit flat.

As a preferable technical scheme of the application, the pose mask module comprises two convolutions with convolution kernels of 3 multiplied by 3 and a slicing unit;

the masking operation of the pose masking module comprises the following steps:

after the pose mask module obtains a first view angle picture, the resolution of the first view angle picture is 1/64 of the resolution of the first view angle of the image, and the obtained first view angle picture is a high resolution picture;

respectively carrying out convolution and slicing on the high-resolution picture by using a convolution with a convolution kernel of 3 multiplied by 3 and a slicing unit to respectively obtain coarse mask information and slice characteristics;

the coarse mask information is flattened through the flat and is consistent with the slice feature dimension, and the slice feature is subjected to coarse mask information matching, so that slice feature mask operation is realized, and intermediate mask features are obtained;

performing convolution operation on the coarse mask information by using convolution with the other convolution kernel of 3 multiplied by 3 to obtain fine mask information, flattening the fine mask information by using the flat to be consistent with the size of the middle mask feature, performing fine mask information matching on the middle mask feature, realizing the middle mask feature mask operation and obtaining pose coding features;

the pose coding features are high-resolution features after the whole image is masked.

As a preferable technical scheme of the application, the pose interaction module comprises a space perception interaction Attention Spatial perception Multi-Head Cross-Attention (S-MHCA) and a multi-layer perception machine Multilayer Perceptron (MLP);

the spatially-aware interaction focus comprises a spatially-aware unit F;

after a query sequence and a keyword sequence after the value sequence and the position coding are acquired, after matrix calculation is performed on the query sequence Q and the keyword sequence K, a space sensing unit F is utilized to perform space sensing on the query sequence Q and the keyword sequence K, and a space sensing feature N is obtained:

N＝F(QK ^T )

the space perception interaction attention calculates a value sequence V and a space perception feature N to obtain a space perception interaction feature M;

inputting the space interaction perception feature M into the multi-layer perception machine, and processing different channels of the space interaction perception feature and the space feature to obtain pose interaction features; d (D) _h Is a constant and 256 is taken in the present application.

And the space perception interaction attention and the multi-layer perception machine are connected by adopting residual errors.

As a preferred technical solution of the present application, the spatial perception unit includes a convolution with a convolution kernel of 1×1;

after the matrix calculation is carried out on the query sequence Q and the keyword sequence K, matrix calculation characteristics are obtained, and the space perception unit carries out the following processing on the matrix calculation characteristics:

performing Layer standardization processing on the matrix calculation characteristics by using Layer standardization Layer Norm to obtain standardized information characteristics;

performing dimension conversion on the standardized information features by using dimension conversion, performing convolution processing by convolution kernel of 1×1, and performing feature activation and dimension conversion by using Gelu to obtain space perception features with the same size as the matrix calculation features;

the convolution processing is carried out through convolution with the convolution kernel of 1 multiplied by 1, so that the characteristics with the same size as the second view angle picture are obtained, the keyword sequence K and the value sequence V are both from the second view angle picture, the image resolution is low, the space is unfolded from one dimension to two dimensions by utilizing dimension conversion, and the space perception capability of the image is improved.

As a preferable technical scheme of the application, in the network training process, the human body posture prediction module calculates the obtained human body posture skeleton characteristics and the human body posture skeleton true value by adopting a mean square error Loss Mean Squared Error Loss to obtain Loss2;

the human body motion prediction module calculates the detected human body motion and the human body motion true value by adopting cross entropy Loss Cross Entropy Loss to obtain Loss1;

reverse gradient return is performed by using the following formula to complete the training process

Loss＝αLoss1+βLoss2

Wherein α and β are values between 0 and 1.

Compared with the prior art, the application has the beneficial effects that:

1. the method comprises the steps of detecting the same user by using two cameras with different angles, processing one camera into a high-resolution picture, performing pose masking by using a pose masking module, extracting a query sequence Q, removing background interference in the masking process, improving the image processing speed, processing the other camera into a low-resolution picture, obtaining pose coding characteristics by using a pose coding module, improving the running speed, providing the pose coding characteristics and the pose coding characteristics for spatial perception interaction attention designed by us, improving the running speed of a model, and guaranteeing the accuracy.

2. The existing Transformer model is applied to human behavior prediction, but two images with different angles cannot be jointly processed, and a real-time effect is difficult to achieve, the method utilizes an interaction Attention Cross-Attention mechanism to respectively encode and mask information of the same pose (a first visual angle picture and a second visual angle picture) with different angles at the input end of a pose interaction module, the pose encoding characteristic can effectively ensure integral information, a pose masking module can effectively pay Attention to the pose of personnel in a bank, and the pose information are jointly input to the Cross-Attention, so that the action recognition accuracy is improved, and the running speed is ensured.

3. By adopting double Loss supervision, the human body posture is predicted, and the behavior is predicted, so that the abnormal behavior prediction of the bank is more accurate, the human body posture prediction module is used as an intermediate supervision process, the accuracy of motion prediction is effectively improved, and the training process has stronger robustness.

Drawings

FIG. 1 is a schematic flow chart of the method of the present application;

FIG. 2 is a schematic flow diagram of each processing module of the method of the present application;

FIG. 3 is a schematic diagram of a pose mask module according to the method of the present application;

FIG. 4 is a second schematic diagram of a pose mask module according to the method of the present application;

FIG. 5 is a schematic diagram of the spatially aware interaction of attention according to the method of the present application;

FIG. 6 is a diagram showing the result of the recognition of the motion of the method of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Examples:

referring to fig. 1 to 6, the present application provides a technical solution: a recognition method of abnormal behaviors of personnel in a bank based on double cameras comprises two cameras, a first camera and a second camera, wherein the two cameras are arranged in the bank at different angles;

shooting a person in a bank by using two cameras;

In our method, the first view angle of the second view angle of the image is processed by using a convolution kernel of 5×5, and when the large convolution is used for processing the image, as much relevant information of the image as possible can be acquired, because the size of the convolution kernel is related to the receptive field, different step sizes and padding operations are adopted to obtain second view angle pictures and first view angle pictures with different sizes.

In the application, the image size of the first view angle picture is [ C, H/8,W/8], the image size of the second view angle picture is [ C, H/16, W/16], wherein C refers to an image channel, W and H refer to the width and the height of an image, so that the processed first view angle picture and the processed second view angle picture are different in size, different-size pictures can be provided for subsequent encoding and masking, the running speed of a network is improved, and the recognition accuracy is ensured.

In one embodiment of the application, the human body posture prediction system further comprises a human body posture prediction module;

Deconvolution is realized by adopting a Deconv module in Simple Baselines for Human Pose Estimation and Tracking, and the deconvolution is utilized to recover the pose interaction characteristics, so that the image size is recovered to be consistent with the original image height and width, and the deconvolution can be calculated with the true value in the data set, and gradient reverse feedback is realized after supervision.

In the application, the output of the human body posture skeleton feature can further ensure the accuracy of motion prediction, double-layer supervision is realized, and compared with the previous supervision mode, the model can ensure more effective motion recognition of personnel in a bank in the training process, and the supervision of the human body posture skeleton feature is added to ensure the detection accuracy of abnormal behaviors in the bank.

In addition, in the network operation process, the dimension alignment of the human body posture framework features and the pose interaction features is completed by utilizing convolution with the convolution kernel size of 3×3, so that the dimension consistency of the human body posture framework features and the pose interaction features is ensured, the network can be trained normally, the human body posture framework features are used as intermediate supervision features to be added into the pose interaction features, and the effectiveness of pose interaction feature training is further improved.

In one embodiment of the present application, the pose masking module obtains a first view image, and performs masking operation on the first view image, where the masking operation process at least includes convolution with a convolution kernel of 3×3;

At present, most of the transform methods adopt a query sequence Q, a keyword sequence K and a value sequence V to perform self-Attention operation, for example, a paper Attention is all you need related to Vit and an latest motion prediction paper AIM ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION, basically, the mode does not consider the relevance between different pictures, but the double cameras acquire images at different angles, and the scheme is difficult to ensure the product relevance between the two angles by self-Attention.

In an embodiment of the present application, the pose coding module uses a residual module composed of res net18, after the second view angle picture is obtained, the resolution of the second view angle picture is 1/256 of the resolution of the second view angle of the image, the obtained second view angle picture is a low resolution picture, and the low resolution picture is subjected to convolutional coding to obtain the overall image low resolution characteristic of the second view angle picture, namely, the pose coding characteristic.

In one embodiment of the application, the pose masking module comprises two convolutions with convolution kernels of 3×3 and a slicing unit, so as to realize slicing feature masking operation and intermediate masking feature masking operation;

respectively carrying out convolution and slicing on the high-resolution picture by using a convolution with a convolution kernel of 3 multiplied by 3 and a slicing unit to respectively obtain coarse mask information and slice characteristics; the slicing unit is realized by adopting dimension conversion Reshape. The method comprises the steps of performing dimension conversion on a high-resolution picture to obtain a plurality of small matrixes with the same size, wherein the small matrixes are image blocks with the same size from the physical sense, the patch slicing operation in a transducer module in the image field is identical, the image blocks can be subjected to position coding, and position confusion in the training process is prevented, and the position coding belongs to the prior art and is not repeated.

in fig. 3 and 4, the person is walking and is in a gesture, background information is gradually covered in the gesture masking process, and the gesture of walking of the person is left, so that the processing is convenient. In the process of matching the rough mask information, the rough mask information is added with slice features to realize the mask of the slice features, so that the position of an image to be focused is determined, the joint part of a person in the image is highlighted, and the non-person part is covered, because the bank machines are more, the acquired image contains background information, on one hand, the accuracy judgment of the action of the person is affected by processing, on the other hand, the operation process is more troublesome, and the accuracy of the operation can be improved by covering the background information.

The intermediate mask feature masking operation includes:

performing convolution operation on the coarse mask information by using convolution with the other convolution kernel of 3 multiplied by 3 to obtain fine mask information, flattening the fine mask information by using the flat to be consistent with the size of the middle mask feature, and performing fine mask information matching on the middle mask feature to obtain pose coding feature;

As shown in fig. 3 and fig. 4, in the method of the present application, the pose coding feature is a high resolution feature after the whole image is masked, the person part image is obtained instead of the whole image, on one hand, since the pose masking module processes the high resolution image itself, if the pose masking module does not mask the high resolution image, the pose masking module always processes the whole image, the memory consumption is too large, the calculation is complex, and the speed is affected.

In addition, in the method, the pose coding module processes the whole picture with low resolution, so that the extraction of background information can be ensured, the whole information distribution can be effectively perceived as a keyword sequence K and a value sequence V in a follow-up pose interaction model, the integrity of personnel detection in a bank is improved, and the processing speed of the whole picture can be effectively improved by adopting the picture with a second visual angle with lower resolution.

In one embodiment of the application, the pose interaction module comprises a spatially aware interaction Attention Spatial perception Multi-Head Cross-Attention (S-MHCA) and a multi-layer perceptron Multilayer Perceptron (MLP);

the spatially-aware interaction focus comprises a spatially-aware unit F;

N＝F(QK ^T )

softmax is the activation function and,middle D _h For multi-head operation, D is the application _h The value is 256. Inputting the space interaction perception feature M into the multi-layer perception machine, and processing different channels of the space interaction perception feature and the space feature to obtain pose interaction features;

In the method, the high resolution of the first view angle picture is utilized, the mask is adopted to remove the background interference, the attention to the person is improved, the whole scene of the second view angle image is acquired, the recognition error caused by the excessive mask is prevented, and the interactive attention mechanism is adopted to complete the training process, so that the method is rare in the field at present. Under the operation of spatially sensing interaction attention, interaction of the second view image and the first view image is realized, and under the spatial sensing unit, the spatial sensing capability of the second view image is improved.

Further, the spatial perception unit comprises convolution with a convolution kernel of 1×1;

As shown in fig. 5, in the method of the present application, the resolution of the second view image is low, so that the formed keyword sequence K and value sequence V can be expanded into a two-dimensional form under the operation of spatially perceiving the interaction attention, that is, the expansion of the spatial features H and W occurs, the convolution kernel is a convolution of 1×1, the spatial feature convolution process is performed, the step size is 1, and then the activation function Gelu is used to activate the convolution kernel, so as to improve the spatial perception capability of the second view image.

Further, in the network training process, the human body posture prediction module calculates the obtained human body posture skeleton characteristics and the human body posture skeleton true value by adopting a mean square error Loss Mean Squared Error Loss to obtain Loss2;

Loss＝αLoss1+βLcoss2

Wherein, alpha and beta are values between 0 and 1, excluding 0 and 1, and generally the sum of the two values is 1, and other values can be taken, and in the training and testing process, the values of beta and beta are both 0.5.

In a laboratory environment, a GeForce RTX 2080Ti graphic card is adopted, when the values of beta and beta are both 0.5, the self-built data set and the official action test data set are tested, the size of an image obtained by a camera is 256 multiplied by 192, the overall running speed of the model is 132fps (average value in one minute after the test is stable), and the accuracy rate reaches 91%. In addition, in order to show the effectiveness of our method under the official data set, the human body posture prediction module adopts 256×192 images on the COCO data set, and inputs the images with the same visual angle by using the double cameras, and the average accuracy AP reaches 74.2, which exceeds the result of A Fast and Effective Transformer for Human Pose Estimation, thus proving the effectiveness of our method, and the running average speed is 157fps (average value within one minute after the test is stable). fps is the number of transmission frames per second (Frames Per Second). Fig. 6 shows our final result, the user can designate the identified action as an abnormal action, and if the identified action is designated to fall down to be an abnormal action (people climbing on the ground of the bank can be considered to be abnormal), the alarm can be triggered to remind the background personnel that someone in the bank is faint, so that the alarm of the abnormal action is realized.

The working principle of the application is as follows: shooting a person in a bank by using two cameras arranged in the bank at different angles; acquiring a first view angle of an image by using a first camera, processing the first view angle of the image to obtain a first view angle picture, and sending the first view angle picture to a pose mask module to obtain pose mask characteristics; acquiring a second visual angle of the image by using a second camera, processing the second visual angle of the image to obtain a second visual angle picture, and sending the second visual angle picture to a pose coding module to obtain pose coding characteristics; inputting the obtained pose mask features and the pose coding features into a pose interaction module for interaction to obtain pose interaction features; the obtained pose interaction characteristics are input into a human body action prediction module, human body actions are detected, and abnormal behaviors of people in a bank are identified.

The pose coding feature is a high-resolution feature after the whole image is masked, the character part image is obtained after the whole image is masked, but not the whole image, because the bank machines are more, the obtained image can be very slow to process if the obtained image contains background information, on one hand, the pose masking module processes the high-resolution image, if the pose masking module does not mask the high-resolution image, the pose masking module always processes the high-resolution image, the memory consumption is overlarge, the calculation is complex, and the speed is influenced.

The pose coding module is used for processing the whole picture all the time, can ensure the extraction of background information, can effectively sense global information distribution as a keyword sequence K and a value sequence V in a follow-up pose interaction model, improves the integrity of personnel detection in a bank, and can effectively improve the processing speed of the whole picture by adopting a second visual angle picture with lower resolution.

In one embodiment of the present application, both β and β take values of 0.5, the running speed is 132fps, the accuracy exceeds 90%, and fig. 6 shows the final detection result of the present application, and in order to embody the performance of the present application, the human body posture skeleton feature is also shown.

The pose interaction module utilizes the high resolution of the first visual angle picture, removes background interference, improves the attention to people, and prevents recognition errors caused by excessive mask after the whole scene of the acquired second visual angle image, thereby completing the training process. Interaction of the second view image and the first view image is achieved by using the space perception interaction attention in the pose interaction module, and the space perception capability of the second view image is improved under the space perception unit.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims

1. A method for identifying abnormal behaviors of personnel in a bank based on double cameras is characterized by comprising the following steps of:

Included

shooting a person in a bank by using two cameras;

inputting the obtained pose interaction characteristics into a human body action prediction module, detecting human body actions, and identifying abnormal behaviors of people in a bank;

the human body posture prediction system also comprises a human body posture prediction module;

in the process of fusing the human body posture framework features and the pose interaction features, the human body posture framework features are subjected to convolution downsampling by means of convolution, the dimensions of the human body posture framework features and the pose interaction features are aligned, and matrix addition operation is utilized to carry out addition fusion on the human body posture framework features and the pose interaction features;

the pose masking module acquires a first view image, and masking operation is carried out on the first view image, wherein the masking operation process at least comprises convolution with a convolution kernel of 3 multiplied by 3;

inputting the query sequence and the keyword sequence after the value sequence and the position code into a pose interaction module to realize interaction attention operation and obtain pose interaction characteristics;

the encoding operation process of the pose encoding module comprises the following steps:

after a residual error module formed by ResNet18 is adopted to obtain a second view angle picture, the resolution of the second view angle picture is 1/256 of the resolution of a second view angle of the image, the obtained second view angle picture is a low resolution picture, and convolutional encoding is carried out on the low resolution picture to obtain the low resolution characteristic of the whole image of the second view angle picture;

the pose mask module comprises two convolutions with convolution kernels of 3 multiplied by 3 and a slicing unit;

the pose masking module masking operation of the first view image comprises the following steps:

the pose interaction module comprises a space perception interaction Attention Spatial perception Multi-Head Cross-Attention (S-MHCA) and a multi-layer perception machine Multilayer Perceptron (MLP); the spatially-aware interaction focus comprises a spatially-aware unit F;

after obtaining a query sequence and a keyword sequence after value sequence and position coding, performing matrix calculation on the query sequence Q and the keyword sequence K, and performing space sensing through a space sensing unit F by using the following formula to obtain a space sensing feature N:

N＝F(QK ^T )

the space perception interaction attention pair value sequence V and the space perception feature N are calculated by the following formula to obtain a space perception interaction feature M;

and inputting the space interaction perception feature M into the multi-layer perception machine, and processing different channels of the space interaction perception feature and the space feature to obtain the pose interaction feature.

2. The method for identifying abnormal behaviors of personnel in a bank based on double cameras as claimed in claim 1, wherein the method comprises the following steps:

the pose coding features are high-resolution features of the whole image after masking, and the spatial perception interaction attention and the multi-layer perception machine are connected by residual errors.

3. The method for identifying abnormal behaviors of personnel in a bank based on double cameras according to claim 1 or 2, wherein the method comprises the following steps:

the spatial perception unit comprises convolution with a convolution kernel of 1 x 1;

after the matrix calculation is carried out on the query sequence Q and the keyword sequence K, matrix calculation characteristics are obtained, and the following processing is carried out by utilizing the space perception unit:

the convolution processing is performed by convolution with a convolution kernel of 1×1, so as to obtain features with the same size as the second view picture, where the key word sequence K and the value sequence V are both from the second view picture.

4. A method for identifying abnormal behaviors of a person in a bank based on double cameras as claimed in claim 3, wherein:

in the network training process, the human body posture prediction module calculates the obtained human body posture skeleton characteristics and the human body posture skeleton true value by adopting a mean square error Loss Mean Squared Error Loss to obtain Loss2;

Loss=αloss1+βloss2, where α and β are values between 0 and 1, excluding 0 and 1.

5. The method for identifying abnormal behaviors of personnel in a bank based on double cameras as claimed in claim 1, wherein the method comprises the following steps: the slicing unit is realized by adopting dimension conversion Reshape.