CN116343330A

CN116343330A - Abnormal behavior identification method for infrared-visible light image fusion

Info

Publication number: CN116343330A
Application number: CN202310211094.6A
Authority: CN
Inventors: 常荣; 唐立军; 党军朋; 张毅; 韩兆武; 杨扬; 易亮
Original assignee: Yuxi Power Supply Bureau of Yunnan Power Grid Co Ltd
Current assignee: Yuxi Power Supply Bureau of Yunnan Power Grid Co Ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-06-27

Abstract

The invention relates to the technical field of image recognition processing, in particular to an abnormal behavior recognition method for infrared-visible light image fusion. Firstly, carrying out image enhancement processing on infrared-visible light images, carrying out target classification and labeling on the fused images, constructing a target detection model as a training sample, and inputting feature vectors corresponding to fused image information into the target detection model to obtain a recognition result; inputting the infrared and visible light fusion video stream into a 3D neural network, and performing feature calculation on the time dimension and the space dimension of video data; and performing feature extraction on the related data of the human body joint point by using the 3D convolutional neural network, and detecting abnormal behaviors according to the gesture information obtained by extracting the human body skeleton and the target position information obtained by visual angle transformation. The design of the invention improves the training speed and reduces the training time; the method has strong adaptability to data, and particularly can obtain better effect under the condition of less calibration data.

Description

Abnormal behavior identification method for infrared-visible light image fusion

Technical Field

The invention relates to the technical field of image recognition processing, in particular to an abnormal behavior recognition method for infrared-visible light image fusion.

Background

The infrared-visible light system is used for realizing all-day and all-weather monitoring by using two technologies of visible light and infrared. The monitoring transmission is realized through various transmission means such as a network, wireless transmission or an optical cable, so that an upper department can intuitively and real-timely control the site situation and can operate a front camera outside a thousand miles for key observation. The system can also be used in the fields of military, public security, fire protection, forest fire prevention in oil fields, traffic management, power grid industry and other important places needing all-day and all-weather monitoring. However, in the existing system, when the system faces severe environments such as fog, insufficient illumination and severe weather, the monitored video image is severely disturbed and influenced, so that the final imaging quality is reduced, the target recognition rate is reduced, and even the monitoring system cannot work, thereby influencing the working stability. Therefore, research on multi-feature infrared-visible light multi-source image enhancement technology provides better monitoring video effect for remote monitoring personnel, and is an important subject in the industry at present.

The main purpose of image enhancement is to solve the problems of complex background and low illumination by using a convolutional neural network, and mainly extracts image feature points to repeatedly perform feature enhancement through convolution, so that the required target difference features are maximized, the recognition accuracy is improved, and a better monitoring video effect is provided for remote monitoring personnel.

The human behavior recognition and deep learning theory is a research hotspot in the field of intelligent video analysis, has received wide attention in academic and engineering circles in recent years, and is a theoretical foundation in the fields of intelligent video analysis and understanding, video monitoring, human-computer interaction and the like. In recent years, deep learning algorithms, which have been widely focused, have been successfully used in various fields such as speech recognition and pattern recognition. Deep learning theory has achieved remarkable achievements in still image feature extraction and has gradually been generalized to video behavior recognition studies with time series. How to further improve the accuracy of human behavior recognition in video images in low-light environments is a technical problem to be solved by the invention.

Therefore, the traditional behavior recognition method and the human behavior recognition method based on deep learning are analyzed and summarized. The traditional method has high requirements on the environment or shooting conditions of the video, and the characteristic extraction method is designed manually and priori. The behavior recognition method based on deep learning does not need to manually design a feature extraction method like the traditional method, and training and learning can be performed on video data to obtain the most effective characterization method. In view of this, we propose an abnormal behavior recognition method for infrared-visible image fusion.

Disclosure of Invention

The invention aims to provide an abnormal behavior identification method for infrared-visible light image fusion, which is used for solving the problems in the background technology.

In order to solve the above technical problems, one of the purposes of the present invention is to provide an abnormal behavior identification method for infrared-visible light image fusion, which comprises the following steps:

s1, carrying out image enhancement processing on infrared-visible light images;

s2, inputting the visible light and the infrared light after the image enhancement into a generator of a Fusion-GAN network, wherein the generator of the Fusion-GAN can capture data distribution, a discriminator can estimate the probability that a sample comes from training data instead of the generator, then the generator and the discriminator are opposed, the discriminator of the Fusion-GAN takes a Fusion image and a visible light image as input to distinguish the Fusion image and the visible light image, the convolution of the generator and the discriminator is changed into a depth separable convolution, the Fusion image and the visible light image are processed by adopting a mobilet-v 3 architecture, the calculated amount is reduced, and the Fusion image is output; inputting the output fusion image into a discriminator to independently adjust the fusion image information to obtain a result; in the process of countermeasure learning of the generator and the discriminator, the fusion image is continuously optimized, and after the loss function reaches balance, the image with the best effect is reserved;

s3, classifying and labeling targets of the fused images, carrying out normalization processing according to category coordinate information, inputting the images and the fused images into a yolov5 network, carrying out HLV color transformation on the fused images, and splicing the images by adopting Mosaic data enhancement to serve as training samples; the improved feature pyramid model is named as AF-FPN, and the self-Adaptive Attention Module (AAM) and the Feature Enhancement Module (FEM) are utilized to reduce information loss and enhance the feature pyramid with representation capability in the process of generating a feature map, so that the detection performance of a YOLOv5 network on a multi-scale target is improved on the premise of ensuring real-time detection, a target detection model is constructed, and feature vectors corresponding to fused image information are input into the target detection model to obtain a recognition result;

s4, after the Fusion of the infrared and visible light images of the improved Fusion-GAN network is completed, inputting an infrared and visible light Fusion video stream into a 3D neural network, and performing feature calculation on the time dimension and the space dimension of video data;

s5, dividing the input video into two independent data streams: a low resolution data stream and an original resolution data stream; the two data streams alternately comprise a convolution layer, a regular layer and an extraction layer, and the two data streams are finally combined into two full-connection layers for subsequent feature recognition;

and S6, performing feature extraction on the related data of the human body joint point by using a 3D convolutional neural network, and detecting abnormal behaviors according to the posture information obtained by extracting the human body skeleton and the target position information obtained by visual angle transformation.

As a further improvement of the present technical solution, before the S1 infrared-visible light images are all subjected to image enhancement processing, the method further includes a step of creating a data set:

continuously acquiring the acquired image data through a camera, transmitting the acquired image data to a processing end, extracting frame by frame, and acquiring image information; the input image is subjected to simple translation, scaling, color change, clipping and Gaussian blur, the category of the image is not influenced, and the problems of insufficient samples and sample quality can be well solved;

the data is trained by adopting an improved yolov5 method, the training times of 500 rounds are respectively set, and when the training times reach 450 rounds, model loss tends to be stable, so that the training times are set to be 500 rounds; setting the initial learning rate to 0.001, 0.0001, 0.0005, 0.00001, and setting the initial learning rate to 0.001 because the model converges faster when the learning rate is 0.0001; the momentum of training was chosen to be 0.9, and the training batch batch_size was set to 2;

in order to promote the model convergence to be quicker and more accurate, a cross entropy function is adopted as the loss function; every 200 iterations, taking a snapshot of the current state, performing multiple training after modifying the value of the Batch-Size for multiple times, and achieving the optimal final convergence accuracy when the Batch-Size is set to be 50.

As a further improvement of the present technical solution, the image enhancement processing in S1 further includes:

after the image feature points are extracted, feature enhancement is repeatedly carried out through an algorithm and training of a deep convolutional neural network, so that the required target difference features are maximized, the recognition accuracy is improved, and a Norm normalization layer is added after a convolutional layer to improve the distinction between a main body and other parts;

wherein the deep convolutional layer neural network comprises 5 convolutional layers (conv), 3 pooling layers (pool), 2 LRN layers (norm), 2 random sampling layers (drop), 3 full connectivity layers (fc), and 1 softmax classification regression layer; the convolution layer (conv) and the pooling layer (pool) alternate, the pooling layer (pool) being max-pooling.

As a further improvement of the present technical solution, the convolution layer parameters are respectively: the blob types of cony1, conv2, conv3, conv4, conv5 are respectively [1, 96, 55, 55], [1, 256, 27, 27], [1, 384, 13, 13] and [1, 256, 13, 13], the steps are respectively 4, 2, 1:

the pool layer parameters were: pool1: [1, 96, 27, 27], pool2: [1, 256, 13, 13], pool5: [1, 256,6,6];

the calculation formula of the convolution is:

in the formula (1), M _j For the input of the set of feature maps,

for the j-th output of the current layer 1, -, a>

For the convolution kernel, input feature map +.>

Convolving (i.e. let)>

For bias, reLU represents an activation function;

the calculation formula of the output dimension of the convolution layer is as follows:

N ₂ ＝(N ₁ -F ₁ +2P)/S+1 (2)；

in the formula (2), the size of the input picture is N ₁ ×N ₁ The convolution kernel has a size F ₁ ×F ₁ The step length is S, and P represents the pixel number of padding, namely the expansion width; the output picture size is N ₂ ×N ₂ ；

The output dimension calculation formula of pool pooling layer is as follows:

N ₃ ＝(N ₁ -F ₂ )/S+1 (3)；

in formula (3), the core size of the pool layer is F ₂ 。

As a further improvement of the present technical solution, in S2, the loss function set by the generator is:

in the formula (4), H and W represent the height and width of the inputted image, respectively,

representing matrix norms>

Representing the gradient operator, ζ is a positive parameter controlling the trade-off between the two terms;

the loss function set by the discriminator is as follows:

in the formula (5), a and b respectively represent the fused image I _v And visible light image I _f Tag D of (2) _θD (I _v ) And D _θD (I _f ) Is the classification result of the two images.

As a further improvement of the present technical solution, the S3 further includes:

the labeling category comprises a safety helmet, a non-wearing safety helmet, a reflective garment and a non-wearing reflective garment;

performing HLV color transformation on the fused images, and splicing the images by adopting Mosaic data enhancement to serve as training samples; setting the learning rate to be 0.001, setting the batch size to be 16, and optimizing the loss function by adopting a gradient descent method; the model is evaluated by adopting accuracy, recall and F1 score, and the model is calculated according to the category calibrated by the model and the category detected by the algorithm, and is divided into: true example TP (TruePositive), false positive example FP (FalsePositive), true negative example TN (TrueNegative), false negative example FN (FalseNegative);

the accuracy, recall and F1-score formulas are as follows:

in the formula (8), P and R are the accuracy Presicon and the Recall ratio Recall calculated in the formulas (6) and (7) respectively;

and testing the trained model, and inputting the feature vector corresponding to the fused image information into the target detection model to obtain a final recognition result.

As a further improvement of the present technical solution, in S4, feature calculation is performed in a time dimension and a space dimension of the video data;

the first layer of the convolutional neural network is a hard-coded convolutional kernel, and comprises gray data, gradients in the z and y directions, optical flows in the z and y directions, 3 convolutional layers, 2 downsampling layers and 1 fully-connected layer;

and 3DCNN is used in the video block with the fixed length, and a multi-resolution convolutional neural network is used for extracting video features.

As a further improvement of the present technical solution, the S4 further includes: unsupervised behavior recognition using an automatic encoder, learning a function h using an AutoEncoder _W，b (z) so that h _W，b Approximately equal to z, an equivalence function is obtained, so that the output of the model is almost equal to the input;

expanding independent subspace analysis to three-dimensional video data, and modeling a video block by using an unsupervised learning algorithm; firstly, an ISA algorithm is used on a small input block, then a learned network and an input image of a larger block are convolved, and responses obtained in the convolution process are combined together to serve as input of a next layer; the resulting description method is applied to video data.

As a further improvement of the present technical solution, the S5 further includes: the static frame data stream uses single frame data, the dynamic data stream between frames uses optical flow data, and each data stream uses a deep convolutional neural network for feature extraction.

As a further improvement of the present technical solution, the S6 further includes:

performing posture estimation on a human body in the fused video by using a 3DCNN network structure to obtain skeleton points of the human body; outputting a plurality of key skeleton points of a human body in real time through a 3DCNN network structure; the coordinates of the bone points of the plurality of parts in the image are respectively recorded as (x _i，，y _i ) Subscript i denotes an articulation point of the i-th part;

using D _body To represent the length of the torso of a human body, where x ₁ ，x ₈ ，x ₁₁ ，y ₁ ，y ₈ ，y ₁₁ Respectively representing the coordinates of the neck, the left waist and the right waist skeleton points; and inputting the feature points obtained by the fusion image through the 3DCNN into an SVM network for classification, classifying the feature points into unsafe behaviors of falling, climbing or pushing, and finally obtaining a final recognition result.

The third object of the present invention is to provide an abnormal behavior recognition platform device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the processor is used for implementing the steps of the abnormal behavior recognition method of the infrared-visible light image fusion when executing the computer program.

A fourth object of the present invention is to provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described abnormal behavior recognition method of infrared-visible light image fusion.

Compared with the prior art, the invention has the beneficial effects that:

1. in the abnormal behavior recognition method for infrared-visible light image fusion, for image enhancement, fine adjustment is performed on a cafene network model, a preprocessing link of a traditional image recognition algorithm is removed, parameters are greatly reduced through sparse connection and weight sharing of a convolution layer, training speed is improved, and training time is shortened; the algorithm has stronger expansibility, and the accuracy of image identification can be improved by increasing the types and the number of pictures of the samples and further optimizing the fine-tuning network structure model, so that the accuracy requirement of image content identification and extraction under low illumination is met;

2. in the abnormal behavior recognition method based on infrared-visible light image fusion, the deep network can learn the features from the data without supervision, and when training samples are enough, the features learned through the deep network often have certain semantic features and are more suitable for recognition of targets and behaviors; the method has strong adaptability to data, and particularly can obtain better effect under the condition of less calibration data.

Drawings

FIG. 1 is an exemplary overall process flow diagram of the present invention;

FIG. 2 is a diagram of an exemplary ReLU function in the invention;

FIG. 3 is a schematic diagram of an exemplary deep convolutional neural network of the present invention;

FIG. 4 is a graph of exemplary test results in accordance with the present invention;

FIG. 5 is a diagram of an exemplary 3DCNN architecture in the present invention;

FIG. 6 is a block diagram of an exemplary multi-resolution convolutional neural network of the present invention;

FIG. 7 is a block diagram of an exemplary back propagation algorithm in accordance with the present invention;

FIG. 8 is a diagram of an exemplary ISA-3D architecture in accordance with the present invention;

FIG. 9 is a schematic diagram of an exemplary acquisition of skeletal points of a human body in accordance with the present invention;

FIG. 10 is a table of results on a KTH, UCFSports, hollyword database using an exemplary automatic encoder of the present invention;

fig. 11 is a block diagram of an exemplary electronic computer platform according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1 to 10, the present embodiment provides an abnormal behavior recognition method for infrared-visible light image fusion, which includes the following steps:

s1, carrying out image enhancement processing on infrared-visible light images;

s2, inputting the visible light and the infrared light after the image enhancement into a generator of a Fusion-GAN network, wherein the generator of the Fusion-GAN can capture data distribution, a discriminator can estimate the probability that a sample comes from training data instead of the generator, then the generator and the discriminator are opposed, the discriminator of the Fusion-GAN takes a Fusion image and a visible light image as input to distinguish the Fusion image and the visible light image, the convolution of the generator and the discriminator is changed into a depth separable convolution, the depth separable convolution is processed by adopting a mobilet-v 3 architecture, the calculated amount is reduced, and the Fusion image is output; inputting the output fusion image into a discriminator to independently adjust the fusion image information to obtain a result; in the process of countermeasure learning of the generator and the discriminator, the fusion image is continuously optimized, and after the loss function reaches balance, the image with the best effect is reserved;

s3, classifying and labeling the target of the fused image, carrying out normalization processing according to category coordinate information, inputting the target and the fused image into a YOLOv5 network, carrying out HLV color conversion on the fused image, splicing the image by adopting Mosaic data enhancement as a training sample, and providing an improved feature pyramid model named AF-FPN, wherein the feature pyramid model is characterized in that the self-Adaptive Attention Module (AAM) and the Feature Enhancement Module (FEM) are utilized to reduce information loss and enhance the representation capability in the generation process of the feature image, the detection performance of the YOLOv5 network on a multi-scale target is improved on the premise of ensuring real-time detection, a target detection model is constructed, and the feature vector corresponding to the fused image information is input into the target detection model to obtain a recognition result;

In this embodiment, the multi-feature infrared-visible multi-source image enhancement process is as follows:

after extracting the image feature points, repeatedly performing feature enhancement through an algorithm and training of a deep convolutional neural network, and acquiring 10 types of 993 pictures by using a Python crawler technology before model training in the embodiment, wherein the 10 types of pictures are divided into a test set of 200 pictures and a training set of 793 pictures. The convolutional neural network can directly input images without complex preprocessing operation, because of the limitation of hardware conditions, the embodiment only performs resolution unification on the images, transforms the images into 256 multiplied by 256, randomly extracts 20 images from 10 images into a test set, places the rest images into a training set, uses a mean value calculation file provided by caffe to subtract the mean value from the images for training, and can reduce the similarity between image data through calculation, thereby greatly improving the training precision and speed.

The present invention is for image enhancement: in the convolutional neural layer, the size of the convolutional kernel affects the abstract effect of the image features. Generally, the larger the convolution kernel, the better the effect, but the smaller the multiple convolution kernels of the training parameters, the finer the effect of the fewer training parameters, which requires more layers to achieve the same effect. In the structure of this embodiment, the first convolution layer uses the convolution kernel of 11X11, and the convolution kernel is larger, and although a better abstract effect can be achieved, the processing is rough, so that the Norm normalization layer is added after Conv, and the distinction between the main body and other parts is improved.

Typically, the convolution layer and the ReLU layer occur in pairs. The expression of the canonical ReLU activation function is: y= {0, max (x) }, when x >0 is input, the output is x itself; if the input is less than or equal to 0, then 0 is output. In convolutional neural networks, it is customary to replace the previous activation functions such as tanh, sigmoid, etc. with a ReLU excitation function, as shown in fig. 2, the derivative of the ReLU function is constant at x >0, while the tanh and sigmoid functions are not, so the ReLU function avoids that the derivative becomes smaller as the tanh and sigmoid functions approach the target at both ends, resulting in a slow convergence due to BP back propagation error when training the neural network. The ReLU has the advantages of fast convergence and simple gradient solving, has sparsity after training, can reduce data redundancy and enhances the expression capability of special region characteristics.

The pooling layer is also called a spatial downsampling layer, and in the convolutional neural network, the pooling layer generally obtains new features after integrating feature points in a small neighborhood by using pooling after image convolution after the convolutional layer. Typically, convolution and pooling exist in the form of Conv-Pool, reducing the redundancy of information caused after convolution. The pool layer is also called a downsampling layer, so that the purpose of reducing the dimension can be achieved, the dimension of the feature vector output by the previous convolution layer is reduced, and the overfitting can be reduced.

The embodiment adopts max-pulling to reduce the noise of the image, and reduces the overfitting phenomenon that the convolution output result of the image is too sensitive to the input error.

The max-pulling algorithm adopted in this embodiment can ensure that the position and rotation of the feature are unchanged for the image first, which is a good feature since the valid feature obtained after the convolution can be extracted regardless of the position at which it appears. In addition, max-pooling greatly reduces the number of parameters of the model in this embodiment, while for the norm layer following the pool layer, the number of neurons is greatly reduced.

The invention carries out fine adjustment on the caffeet network model, generates an optimal recognition network model for the data set, and summarizes some advantages of the deep convolutional neural network: the preprocessing link of the traditional image recognition algorithm is eliminated, parameters are greatly reduced through sparse connection and weight sharing of the convolution layer, training speed is improved, and training time is shortened. The embodiment has the defects that the hardware environment is poor, the samples are small, but the algorithm has strong expansibility, and the accuracy of image identification can be improved by increasing the types and the number of pictures of the samples and further optimizing and fine-tuning the network structure model of the embodiment, so that the accuracy requirement of image content identification and extraction under low illumination is met.

Further, the algorithm and training based on the deep convolutional neural network are as follows:

as shown in fig. 3, the deep convolutional layer neural network in this embodiment comprises 5 convolutional layers (conv), 3 pooling layers (pool), 2 LRN layers (norm), 2 random sampling layers (drop), 3 full connectivity layers (fc) and 1 softmax classification regression layer; the convolution layer (conv) and the pooling layer (pool) alternate, the pooling layer (pool) being max-pooling.

Wherein the convolution layer parameters are respectively: the blob types of conv1, conv2, conv3, conv4 and conv5 are respectively [1, 96, 55, 55], [1, 256, 27, 27], [1, 384, 13, 13] and [1, 256, 13, 13], and the steps are respectively 4, 2, 1 and 1;

the pool layer parameters were: pool1: [1, 96, 27, 27], pool2: [1, 256, 13, 13], pool5: [1, 256,6,6]; the calculation formula of the convolution is as follows:

in the formula (1), M _j For the input of the set of feature maps,

for the j-th output of the current layer 1, -, a>

For the convolution kernel, input feature map +.>

Convolving (i.e. let)>

For bias, reLU represents an activation function;

N ₂ ＝(N ₁ -F ₁ +2P)/S+1 (2)；

The output dimension calculation formula of pool pooling layer is as follows:

N ₃ ＝(N ₁ -F ₂ )/S+1 (3)；

in formula (3), the core size of the pool layer is F ₂ 。

In this embodiment, specific network parameters are set according to the own data set, fig. 4 is an iteration 1000, and each 50 iterations test a training learning network on the test set, and output loss values and accuracy. For every 200 iterations, a snapshot of the current state is taken. After the value of the Batch-Size is modified for multiple times, training is carried out for multiple times, the final convergence accuracy is optimal when the Batch-Size is set to be 50, and the average recognition rate of the model to the image is highest and reaches 92.50%.

Analysis shows that too small a Batch-Size can cause too large a concussion in recognition rate. The reason that the recognition accuracy can be improved by adjusting the value of the Batch-Size is that under the condition that the data set is smaller, the more accurate the determined descending direction is, the training oscillation can be reduced, the CPU utilization rate is improved, and the large matrix multiplication calculation efficiency is improved. Since the final convergence accuracy falls into different local extrema, when the batch_size increases to a certain value, the optimum in the final convergence accuracy is reached.

In this embodiment, the abnormal behavior recognition process of the infrared-visible light image fusion is as follows:

step 1, inputting the visible light and the infrared light after the image enhancement into a generator of a Fusion-GAN network, changing the convolution of the generator and a discriminator into a depth separable convolution, adopting a mobilet-v 3 architecture for processing, reducing the calculated amount and outputting a Fusion image; then inputting the output fusion image into a discriminator to independently adjust the fusion image information, and obtaining a result;

wherein, the loss function of the setting generator is:

representing matrix norms>

the loss function set by the arbiter is as follows:

in the formula (5), a and b respectively represent the fused image I _v And visible light image I _f Tag D of (2) _θD (I _v ) And D _θD (I _f ) The classification result of the two images;

and in the process of countermeasure learning of the generator and the discriminator, the fusion image is continuously optimized, and after the loss function reaches balance, the image with the best effect is reserved.

Marking the fused images by labelimg marking software, wherein marking categories are safety helmet, unworn safety helmet, reflective clothing, unworn reflective clothing and the like, storing the images in an xml format, and normalizing the category coordinate information in the xml format to form the coordinate information of a txt file storage category; inputting txt and a fusion image into a YOLOv5 network, performing HLV color transformation on the fused image, splicing the images by adopting Mosaic data enhancement as a training sample, and providing an improved feature pyramid model named AF-FPN, wherein an Adaptive Attention Module (AAM) and a Feature Enhancement Module (FEM) are utilized to reduce information loss and enhance a feature pyramid representing capability in a feature image generating process, so that the detection performance of the YOLOv5 network on a multi-scale target is improved on the premise of ensuring real-time detection, and a target detection model is constructed;

the learning rate is set to be 0.001, the batch size is 16, and a gradient descent method is adopted to optimize the loss function; the model is evaluated by adopting accuracy, recall and F1 score, and is calculated according to the category calibrated by the model and the category detected by the algorithm, and the model is divided into the following 4 categories: true (TruePositive, TP), false (FalsePositive, FP), true (TrueNegative, TN), false (FalseNegative, FN); the accuracy, recall, and F1-Score formulas are as follows:

in the formula (8), P and R are the accuracy Presicon and the Recall ratio Recall calculated in the formulas (6) and (7), respectively:

Step 3, after Fusion-GAN network has fused infrared and visible light images, inputting infrared and visible light fused video stream into 3D neural network, 3DCNN is that traditional CNN is expanded to 3DCNN with time information as shown in fig. 5, and performing feature calculation in time dimension and space dimension of video data;

the first layer of the convolutional neural network is a hard-coded convolutional kernel, and comprises gray data, gradients in the z and y directions, optical flows in the z and y directions, 3 convolutional layers, 2 downsampling layers and 1 fully-connected layer; using 3DCNN within a fixed-length video block, extracting video features using a multi-resolution convolutional neural network, the input video is split into two independent data streams: a low resolution data stream and an original resolution data stream; both data streams alternately comprise a convolution layer, a regularization layer and an extraction layer, and the two data streams are finally combined into two full connection layers for subsequent feature recognition, and the structure diagram is shown in fig. 6. A convolutional neural network of two data streams is also used for video behavior recognition. They separate video into a static frame data stream and an inter-frame dynamic data stream. The static frame data stream may use single frame data, the inter-frame dynamic data stream uses optical flow data, and each data stream uses a deep convolutional neural network for feature extraction. And finally, identifying the action of the obtained features by using the SVM. They propose to use only the relevant data of the joint point part of the human body posture to perform the feature extraction by the deep convolution network, finally use the statistical method to convert the whole video into a feature vector, and use the SVM to perform the training and recognition of the final classification model.

Performing feature extraction on the related data of the human body joint point by using a 3D convolutional neural network, and detecting abnormal behaviors according to gesture information obtained by extracting a human body skeleton and target position information obtained by visual angle transformation;

the design 3DCNN consists of 8 convolutional layers, 5 pooled layers and 2 fully-connected layers, including a softmax function, the input size of the network is 3 x 16 x 112, the size of the convolutional kernel is set to 3 x 3, the step length is 1 multiplied by 1, the input fusion video stream is subjected to convolution calculation, after calculation, the characteristic image is pooled, the size of a pooling kernel is 2 multiplied by 2, the step length is 2 multiplied by 2, and 4098 output is performed in total. Setting the training learning rate as 0.001, training times as 100 batches, and stopping training when the loss function is minimum to obtain the optimal model.

And estimating the posture of the human body in the fused video by using a 3DCNN network structure to obtain skeleton points of the human body. As shown in fig. 9, 18 key skeletal points of eyes, arms, knees, etc. of a human body are output in real time through a 3DCNN network structure.

The coordinates of the bone points of 18 sites in the image were recorded as (x _i ，y _i ) Subscript i denotes an articulation point of the i-th part; using D _body To represent the length of the torso of a human body, where x ₁ ，x ₈ ，x ₁₁ ，y ₁ ，y ₈ ，y ₁₁ The coordinates of the neck and the left and right waist skeleton points are respectively represented. And inputting the feature points obtained by the fusion image through the 3DCNN into an SVM network for classification, classifying the feature points into unsafe behaviors such as falling, climbing, pushing and the like, and finally obtaining a final recognition result.

In addition, in this step, an automatic encoder can also be used for unsupervised behavior recognition. The automatic encoder is an unsupervised learning algorithm that uses a back-propagation algorithm to let the target value equal to the input value, as shown in fig. 7. Learning a function h using AutoEncoder _W，b (z) so that h _W，b Approximately z, an equivalence function is obtained such that the output of the model is almost equal to the input. Expanding independent subspace analysis onto three-dimensional video data usingAn unsupervised learning algorithm models the video blocks. This method first uses the ISA algorithm on a small input block, then convolves the learned network with the input image of a larger block, and combines the responses from the convolution process together as the input to the next layer, as shown in fig. 8. The resulting description method was applied to video data, and this method was tested on three well-known behavior recognition libraries simultaneously, as shown in table 1 of fig. 10, which is the results thereof on the KTH, UCFSports, hollyword database. It can be seen that the ISA algorithm achieves more excellent performance on the Hollywood2 dataset with a complex environment, which is nearly 10% higher than the spatiotemporal point of interest algorithm.

In addition, the invention recognizes abnormal behavior of human body: aiming at the problem of behavior identification of a target person under a low-illumination condition, infrared and visible light image fusion and behavior identification are combined, abnormal behaviors are detected according to gesture information obtained by extracting a human skeleton and target position information obtained by visual angle transformation, feature extraction is carried out on related data of a human joint point by using a 3D convolutional neural network, the abnormal behaviors are detected according to the gesture information obtained by extracting the human skeleton and the target position information obtained by visual angle transformation, a human motion feature model library of the illegal behaviors is formed, and the matching actions of the on-site construction video and the model library after the model library is established are illegal actions. Based on infrared-visible light image fusion, the following illegal behavior detection is to be realized under low illumination: climbing detection, personnel identification, area intrusion detection, safety belt detection, insulator detection, safety helmet detection and the like. The identification precision (precision) target value is more than or equal to 95 percent, recall target value (recall) is more than or equal to 90 percent, and speed (FPS) target value is 30. Since the deep network can learn features from data unsupervised, and the learning mode accords with the mechanism of human perception world, the features learned by the deep network often have certain semantic features when training samples are enough, and are more suitable for identifying targets and behaviors. The method has strong adaptability to data, and particularly can obtain better effect under the condition of less calibration data; convolutional neural networks have achieved excellent results in terms of image recognition.

As shown in fig. 11, the present embodiment also provides an abnormal behavior recognition platform apparatus, which includes a processor, a memory, and a computer program stored in the memory and running on the processor.

The processor comprises one or more than one processing core, the processor is connected with the memory through a bus, the memory is used for storing program instructions, and the processor realizes the steps of the abnormal behavior identification method for infrared-visible light image fusion when executing the program instructions in the memory.

Alternatively, the memory may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

In addition, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the abnormal behavior identification method for infrared-visible light image fusion when being executed by a processor.

Optionally, the present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the method for identifying abnormal behavior of infrared-visible light image fusion of the above aspects.

It will be appreciated by those of ordinary skill in the art that the processes for implementing all or part of the steps of the above embodiments may be implemented by hardware, or may be implemented by a program for instructing the relevant hardware, and the program may be stored in a computer readable storage medium, where the above storage medium may be a read-only memory, a magnetic disk or optical disk, etc.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The abnormal behavior identification method for infrared-visible light image fusion is characterized by comprising the following steps of:

s1, carrying out image enhancement processing on infrared-visible light images;

s2, inputting the visible light and the infrared light after the image enhancement into a generator of a Fusion-GAN network, changing the convolution of the generator and a discriminator into a depth separable convolution, adopting a mobilet-v 3 architecture for processing, reducing the calculated amount and outputting a Fusion image; inputting the output fusion image into a discriminator to independently adjust the fusion image information to obtain a result; in the process of countermeasure learning of the generator and the discriminator, the fusion image is continuously optimized, and after the loss function reaches balance, the image with the best effect is reserved;

s3, classifying and labeling targets of the fused images, carrying out normalization processing according to category coordinate information, inputting the images and the fused images into a yolov5 network, carrying out HLV color transformation on the fused images, and splicing the images by adopting Mosaic data enhancement to serve as training samples; the improved feature pyramid model is named as AF-FPN, and the self-adaptive attention module and the feature enhancement module are utilized to reduce information loss and enhance the feature pyramid with representation capability in the process of generating the feature map, so that the detection performance of a YOLOv5 network on a multi-scale target is improved on the premise of ensuring real-time detection, a target detection model is constructed, and feature vectors corresponding to fused image information are input into the target detection model to obtain a recognition result;

2. The method for identifying abnormal behavior of infrared-visible light image fusion according to claim 1, further comprising the step of creating a data set before the S1 infrared-visible light images are each subjected to image enhancement processing:

continuously acquiring the acquired image data through a camera, transmitting the acquired image data to a processing end, extracting frame by frame, and acquiring image information; performing simple translation, scaling, color change, clipping and Gaussian blur on an input image;

training the data by adopting an improved yolov5 method, respectively setting training times of 500 rounds, and setting the initial learning rate to be 0.001; the momentum of training was chosen to be 0.9, and the training batch batch_size was set to 2;

the loss function adopts a cross entropy function to promote the model to be converged more rapidly and accurately.

3. The abnormal behavior recognition method of infrared-visible light image fusion according to claim 1, wherein the image enhancement processing in S1 further comprises:

wherein the deep convolutional layer neural network comprises 5 convolutional layers, 3 pooling layers, 2 LRN layers, 2 random sampling layers, 3 fully connected layers, and 1 softmax classification regression layer; the convolution layer and the pooling layer alternate, and the pooling layer is max-pooling.

4. The method for identifying abnormal behavior of infrared-visible light image fusion according to claim 3, wherein the convolution layer parameters are respectively: the blob types of conv1, conv2, conv3, conv4 and conv5 are respectively [1, 96, 55, 55], [1, 256, 27, 27], [1, 384, 13, 13] and [1, 256, 13, 13], and the steps are respectively 4, 2, 1 and 1;

the pool layer parameters were: pool l: [1, 96, 27, 27], pool2: [1, 256, 13, 13], pool5: [1, 256,6,6];

the calculation formula of the convolution is:

in the formula (1), M _j For the input of the set of feature maps,

for the j-th output of the current layer 1, -, a>

For the convolution kernel, input feature map +.>

Convolving (i.e. let)>

For bias, reLU represents an activation function;

N ₂ ＝(N ₁ -F ₁ +2P)/S+1 (2)；

in the formula (2), the size of the input picture is N ₁ ×N ₁ The convolution kernel has a size F ₁ ×F ₁ Step size S, P represents padThe pixel number of ding, namely expanding the width; the output picture size is N ₂ ×N ₂ ；

The output dimension calculation formula of pool pooling layer is as follows:

N ₃ ＝(N ₁ -F ₂ )/S+1 (3)；

in formula (3), the core size of the pool layer is F ₂ 。

5. The method for identifying abnormal behavior of infrared-visible light image fusion according to claim 1, wherein in S2, the loss function set by the generator is:

representing matrix norms>

the loss function set by the discriminator is as follows:

6. The abnormal behavior recognition method of infrared-visible light image fusion according to claim 1, wherein S3 further comprises:

performing HLV color transformation on the fused images, and splicing the images by adopting Mosaic data enhancement to serve as training samples; setting the learning rate to be 0.001, setting the batch size to be 16, and optimizing the loss function by adopting a gradient descent method; the model is evaluated by adopting accuracy, recall and F1 score, and the model is calculated according to the category calibrated by the model and the category detected by the algorithm, and is divided into: true example TP, false positive example FP, true negative example TN, false negative example FN;

the accuracy, recall and F1-score formulas are as follows:

7. The method for identifying abnormal behavior of infrared-visible light image fusion according to claim 1, wherein in S4, feature computation is performed in a time dimension and a space dimension of video data;

8. The method for identifying abnormal behavior in accordance with claim 7, wherein S4 further comprises: unsupervised behavior recognition using an automatic encoder, learning a function h using an AutoEncoder _W，b (iz) so that h _W，b Approximately equal to z, an equivalence function is obtained, so that the output of the model is almost equal to the input;

9. The method for identifying abnormal behavior of infrared-visible light image fusion according to claim 1, wherein S5 further comprises: the static frame data stream uses single frame data, the dynamic data stream between frames uses optical flow data, and each data stream uses a deep convolutional neural network for feature extraction.

10. The method for identifying abnormal behavior in combination with infrared-visible light image according to claim 1, wherein S6 further comprises:

performing posture estimation on a human body in the fused video by using a 3DCNN network structure to obtain skeleton points of the human body; outputting a plurality of key skeleton points of a human body in real time through a 3DCNN network structure; the coordinates of the bone points of the plurality of parts in the image are respectively recorded as (x _i ，y _i ) Subscript i denotes an articulation point of the i-th part;

using D _body To represent the length of the torso of a human body, where x ₁ ，x ₈ ，x ₁₁ ，y ₁ ，y ₈ ，y ₁₁ Respectively representing the coordinates of the neck, the left waist and the right waist skeleton points; and inputting the feature points obtained by the fusion image through the 3DCNN into an SVM network for classification, classifying the feature points into falling, climbing or charging unsafe behaviors, and finally obtaining a final recognition result.