CN115761888A

CN115761888A - Tower crane operator abnormal behavior detection method based on NL-C3D model

Info

Publication number: CN115761888A
Application number: CN202211462437.8A
Authority: CN
Inventors: 邓珍荣; 李志宏; 蓝如师; 杨睿
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-07

Abstract

The invention discloses a tower crane operator abnormal behavior detection method based on an NL-C3D model, which comprises the following steps of: 1) Collecting a monitoring video data set of a tower crane operation operator in an operation process; 2) Dividing video data into image frames through an algorithm, and then cutting the image size of the image frames; 3) Fusing a non-local module in a C3D network to obtain an NL-C3D network model; 4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier. The method improves the detection precision and ensures more detailed detection.

Description

Tower crane operator abnormal behavior detection method based on NL-C3D model

Technical Field

The invention belongs to the field of behavior identification in computer vision, and relates to an abnormal behavior identification and detection method, in particular to a tower crane operator abnormal behavior detection method based on an NL-C3D model.

Background

With the rapid development of video monitoring technology, the technology has been widely applied to various industries, the data volume of monitoring videos is rapidly increased, and abnormal behavior detection becomes an important research task, and particularly, the corresponding abnormal behavior detection becomes a difficult point of research in various fields of designing security and protection.

Conventional abnormal behavior detection methods are, for example, such as those based on manual features, which record characteristic patterns of motion using low-level trajectory features, histogram of Oriented Flow (HOF), histogram of Oriented Gradient (HOG), and the like. But the use of complex scenes of surveillance video is not recommended because manual means are not sufficient to characterize the behavior. In addition, new methods based on deep learning are emerging continuously, and there are methods based on a recurrent neural network, such as a high-precision analysis model proposed by Yeung et al based on RNN and then training by reinforcement learning, and a long short-term memory network (LSTM) proposed by esccora et al, which are very inefficient in processing long videos, and the extracted basic features do not support joint training. In addition, there is a two-stage detection method: some possible regions are pre-selected from the video and then these candidate regions are classified. This type of method also has the problem of time consuming and inefficient re-preselection of regions, and the phased approach may find the optimal solution locally and not guarantee a globally optimal solution. When processing video, the main focus of these networks is analysis of the current frame, and analysis of the previous and subsequent frames of the current frame is insufficient, which is very important as context information of the video for the consecutive movement of people in the video.

Disclosure of Invention

The invention aims to provide a tower crane operator abnormal behavior detection method based on an NL-C3D model, aiming at the problems in the prior art. The method strengthens the context modeling capability of the conventional convolutional neural network by optimizing the network structure and integrating global features, and has better performance in behavior identification and detection in videos.

The technical scheme for realizing the purpose of the invention is as follows:

a tower crane operator abnormal behavior detection method based on an NL-C3D model comprises the following steps:

1) Collecting a monitoring video data set related to the operation process of tower crane operators, and dividing the video data set into a training set, a verification set and a test set;

2) Dividing video data into image frames by an algorithm, then cutting the image size of the image frames to ensure the size of the images is consistent, and keeping a corresponding number of image frame samples, specifically, adjusting the imported video frames to a size that shape is [10,16,112,112,3], wherein 16 represents frame _ length, which means that the training size of each sample is 16 frames, and 112 represents crop _ size of the images, which means that the cut size of the video frames is 112 × 112 pixels, and 3 represents the number of input channels;

3) Fusing a non-local module in the C3D network to obtain an NL-C3D network model;

4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier. The process of collecting the video data set in the step 1) is as follows: taking an operation video by a camera, wherein the resolution is more than 320 pixels by 240 pixels, the frame rate is more than 25 frames per second, then, cutting one frame of the data set by every four frames to obtain a frame image, manually reducing the sampling step size of the image until meeting the regulation of at least sixteen frames for equipment which cannot make the entering time width of the network structure exceed sixteen frames according to the interval number, and dividing the data set into a training set, a verification set and a test set according to the ratio of 6.

Cutting the image size of the image frame in the step 2), and keeping a corresponding number of image frame samples, and the method specifically comprises the following steps: in the input processing process, in order to enhance the safety and the precision of the model, firstly, image frames are cut into 112 × 112 pixels randomly, then the initial address of a video frame of a selected network is determined in the output video frame, then a network input video frame of sixteen frames is selected on the address through a sliding window, the size of the selected video frame is 3 × 16 × 112 × 112, meanwhile, the data enhancement processing is realized by random inversion and subtraction operation sequentially performed along three paths of the image frames RGB, and finally, the graphics are marked by using graphics marking software Labelimg, and abnormal behaviors marked as 'calling', 'smoking', 'playing mobile phone' and 'dozing' are marked.

The step of fusing the non-local module in the C3D network in step 3) to obtain the NL-C3D network model includes:

3.1 The original C3D network model adopts 3D convolution and 3D pooling as main bodies, each main body is composed of 8 channels of 64, 128, 256, 512, 3D convolutional layers, 5 3D pooling layers, 2 full connection layers and a softmax classifier, and when the convolutional layers are fused with the non-local neural network, compared with the original C3D network model, the non-local network is fused with the convolutional layers as a whole in a residual error connection mode, so that each convolutional layer is fused with a non-local neural network module;

3.2 The shape of the input X of the C3D convolution module is T, H, W and C, wherein T is the number of channels of an image, H is the length of a video frame, W is the height of the video frame, and C is the length of the video frame, and after a non-local neural network module is fused in a C3D network model, the input X is respectively input with theta, W and C,

The convolution module of g, theta,

And g are respectively corresponding to 1 multiplied by 1 convolution with the step length of 1, and then the output results of the convolution modules are subjected to matrix dimension change;

3.3 Is a) is a reaction of theta with

The variable dimension calculation result of the step (A) is subjected to matrix addition to obtain a matrix of (C, C), then normalization analysis is completed through Softmax, and further matrix multiplication is completed on the analyzed result and the calculation result after the g branches are subjected to variable dimension;

3.4 And) the result obtained in the step 3.3) is subjected to dimension change and then input into a g convolution module, and finally the result is subjected to residual error addition with the result of the input X to obtain a C3D network model fused with the non-local neural network module, namely an NL-C3D network model.

The step of importing the image frames into the NL-C3D network model for training and checking in the step 4) comprises the following steps:

4.1 Transmitting video frames with the size of 3 × 16 × 112 × 112 into an NL-C3D network model, wherein the NL-C3D network hierarchy respectively comprises a 64-channel convolutional layer, a 128-channel convolutional layer, two 256-channel convolutional layers, two 512-channel convolutional layers, and two 512-channel convolutional layers, the layers above are respectively followed by a pooling layer, and then a full-link layer with two 2096 dimensions and a softmax layer, and finally outputting dimension information of [10, n ], wherein n is the category number of a data set used for training;

4.2 After fusing a non-local neural network to a convolutional layer, the NL-C3D network model enhances the local characteristics of the target by compressing the channel characteristics and aggregating global spatial features, first, statistics is performed on the similarity values between the pixel points at the current position and all the pixel points in the feature map, then feature weighted summation is performed on the region where the similarity values exist, so as to increase the feature information of the region, and further realize the effect of global characteristic improvement, non-local operation performs weighted summation on the value of a certain region and all the feature information of feature mapping, as shown in formula (1):

where x, y represent input features and output features, respectively, corresponding to feature images in graphics and video, both having the same dimensions, i represents the current position code of a feature point, and j represents other features in the feature imageCoding of points; function f (x) _i ,x _j ) Then represents x _i And x _j The correlation degree between the two is described, namely the smaller the f value is, the smaller the interference degree of j to i is; g (x) _j ) Is a linear combination function that provides the features of the graph at j; c (x) is a normalization parameter, f (x) _i ,x _j ) Is a gaussian function, as shown in equation (2):

the normalization factor C (x) is expressed as shown in equation (3):

from the formula (1), since the non-local operation considers the relationship between the current address and all the positions in the characteristic position, it can effectively capture the multi-position dependence relationship of the video frame, and the connection between the convolution layer and the non-local neural network adopts a residual connection structure, when the non-local operation is implemented specifically, it will be converted into the form of matrix multiplication and convolution operation, after various operations and conversions, the output characteristic dimension Z has the same dimension as the input X, so it can be directly added to each convolution module of the network without modifying the network, and the convolution part with the non-local neural network can be defined as shown in the formula (4):

Z _i ＝W _z Y _i +X _i (4)，

wherein Y is _i Obtained by the operation of the formula (1), W _z Then is the weight matrix,' + X _i Representing residual connection, and obtaining the space-time characteristics of the video without interfering the original parameters and the initialization method in the model by using the residual connection mode;

4.3 In the convolution aspect, a three-dimensional convolution is used, which can perform convolution operation on adjacent frames to obtain information from the space and time dimensions of the video, so that the space data information and the time data information can be kept, compared with 2D convolution, one depth dimension is added in an input image, a convolution kernel is added by one dimension, and the input size is 3 × 16 × 112 × 112 for multiple channels;

4.4 To prevent overfitting, a Dropout layer is introduced into each layer of the NL-C3D network model, nodes in the neural network are eliminated randomly, lines connected with the nodes are deleted, complexity of the network is reduced, the Dropout rate of the model is rho, and the reserved probability is 1-rho;

4.5 The loss function part is an index for measuring the training of the network structure on the data set, the larger the value is, the more errors are, the loss function is used as a reference standard, and a cross entropy loss function is used in the model, as shown in formula (5):

H(p,q)＝-∑p(x)log q(x) (5)，

reflecting the difficulty degree of probability distribution p through probability distribution q, wherein p represents the probability of correct answers, q represents a predicted value, and the smaller the cross entropy is, the smaller the difference of the distribution values of the two probabilities is;

on this basis, the value probability of each class is found using the Softmax function, which is shown in equation (6):

wherein S represents the classification probability score of M for each result, and the average score S of each category is obtained through M ₁ ,S ₂ ,...,S _M When the estimation is carried out, a certain category score is divided by the sum of various index scores to obtain an actual category with minimum loss on the basis, and then the probability of the category is maximum, so that a classification result is obtained.

Compared with the prior art, the technical scheme has the following advantages:

1. the method of the technical scheme can accurately detect the human behavior in the video;

2. the technical scheme adopts the simulation of fusing the channel characteristics and the aggregation network space characteristics in the non-local neural network so as to improve the local characteristics and the detection precision;

3. to prevent overfitting, the newly introduced Dropout layer can randomly eliminate some nodes in the neural network and also remove all routes connecting these nodes to reduce the complexity of the network.

The method is based on the improvement of a C3D model, and a local neural network model is fused in a 3D convolution part, so that the problem of long-distance dependence of a video frame is solved, the understanding of characteristic information is enhanced, and the detection accuracy is improved; and Dropout layer calculation is added in each layer, so that the calculation amount is reduced, overfitting is prevented, and the detection speed is improved.

Drawings

FIG. 1 is a flow chart of an embodiment;

FIG. 2 is a schematic diagram of a non-local neural network according to an embodiment;

FIG. 3 is a schematic diagram of a NL-C3D network in an embodiment.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples, but the invention is not limited thereto.

The embodiment is as follows:

referring to fig. 1, the method for detecting abnormal behaviors of tower crane operators based on the NL-C3D model includes the following steps: 1) Collecting a monitoring video data set related to the operation process of tower crane operators, and dividing the video data set into a training set, a verification set and a test set;

2) Dividing video data into image frames by an algorithm, then cutting the image size of the image frames to ensure the size of the images to be consistent, and keeping a corresponding number of image frame samples, specifically, adjusting the imported video frames to a size where shape is [10,16,112,112,3], wherein 16 represents frame _ length, which means the training size of each sample is 16 frames, 112 represents crop _ size of the images, which means the cut size of the video frames is 112 × 112 pixels, and 3 represents the number of input channels;

3) Fusing a non-local module in a C3D network, as shown in FIG. 2, to obtain an NL-C3D network model, as shown in FIG. 3;

4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier. The process of collecting the video data set in the step 1) is as follows: the method comprises the steps of shooting an operation video by a camera, wherein the resolution is 320 pixels by 240 pixels, the frame rate is more than 25 frames per second, then, intercepting one frame of a data set every four frames to obtain a frame image, obtaining 1620 pictures after interception, and dividing the data set into a training set, a verification set and a test set according to the ratio of 6.

The image size of the image frame is cut in the step 2), and the image frame samples with corresponding number are kept, and the method specifically comprises the following steps: in the input processing process, in order to enhance the safety and the precision of the model, firstly, image frames are cut into 112 × 112 pixels randomly, then the initial address of a video frame of a selected network is determined in the output video frame, then a network input video frame of sixteen frames is selected on the address through a sliding window, the size of the selected video frame is 3 × 16 × 112 × 112, meanwhile, the data enhancement processing is realized by random inversion and subtraction operation sequentially performed along three paths of the image frames RGB, and finally, the graphics are marked by using graphics marking software Labelimg, and abnormal behaviors marked as 'calling', 'smoking', 'playing mobile phone' and 'dozing' are marked.

The step 3) of fusing the non-local module in the C3D network to obtain the NL-C3D network model includes:

3.1 The original C3D network model adopts 3D convolution and 3D pooling as a main body, the main body is composed of 8 3D convolution layers with 64 channels, 128 channels, 256 channels, 512 channels, 5 3D pooling layers, 2 full-link layers and a softmax classifier, and when a convolution part is fused with a non-local neural network, compared with the original C3D network model, the non-local network is fused with the convolution layers as a whole in a residual error connection mode, so that each convolution layer is fused with a non-local neural network module;

3.2 The shape of input X of the C3D convolution module is T, H, W and C, wherein T is the number of channels of an image, H is the length of a video frame, W is the height of the video frame, and C is the length of the video frame, and the input X is fused in a C3D network modelAfter the non-local neural network module is input, X is respectively input into theta,

The convolution module of g, theta,

G is respectively corresponding to 1 multiplied by 1 convolution with the step length of 1, and then the output results of the convolution modules are subjected to matrix dimension change;

3.3 Is a) is a reaction of theta with

Obtaining a matrix of (C, C) by matrix addition of the dimension-variable calculation result, then completing normalization analysis by Softmax, and completing further matrix multiplication of the analyzed result and the calculation result after g branch dimension change;

wherein x and y respectively represent input features and output features, which are equivalent to feature images in graphics and videos, and both have the same dimension, i represents the current position code of a feature point, and j represents the codes of other feature points in the feature images; function f (x) _i ,x _j ) Then represents x _i And x _j The correlation degree between the two is described, namely the smaller the f value is, the smaller the interference degree of j to i is; g (x) _j ) Is a linear combination function that provides the features of the graph at j; c (x) is a normalization parameter, f (x) _i ,x _j ) Is a gaussian function, as shown in equation (2):

the normalization factor C (x) is expressed as shown in equation (3):

Z _i ＝W _z Y _i +X _i (4)，

wherein Y is _i Obtained by the operation of formula (1), W _z Then is the weight matrix,' + X _i Representing residual connection, and obtaining the space-time characteristics of the video without interfering the original parameters and the initialization method in the model by using the residual connection mode;

H(p,q)＝-∑p(x)log q(x) (5)，

on the basis, the value probability of each class is obtained by using a Softmax function, and the Softmax function is shown as an equation (6):

wherein S is expressed as a classification probability score of M for each result, and each result is obtained through MAverage score S of individual categories ₁ ,S ₂ ,...,S _M When the estimation is carried out, the score of a certain category is divided by the sum of the scores of various indexes to obtain the actual category with the minimum loss on the basis, and then the probability of the category is the maximum, so that the classification result is obtained.

Performance evaluation:

the results of comparing the NL-C3D network model with the C3D network model using the same data set in the same experimental environment with the accuracy and the elapsed time as evaluation indexes are shown in table 1:

TABLE 1 comparison of Performance before and after improvement of the model

Network model	Rate of accuracy	Elapsed time/s
			C3D	0.72	268
NL-C3D	0.75	237

From the above table, it can be seen that the NL-C3D model improves both the accuracy and the time consumption of the detection, because the 3D convolution part fuses the local neural network model, the problem of long-distance dependence of the video frame is solved, the understanding of the feature information is enhanced, and the accuracy of the detection is improved; and Dropout layer calculation is added in each layer, so that the calculation amount is reduced, overfitting is prevented, and the recognition speed is improved.

Claims

1. A tower crane operator abnormal behavior detection method based on an NL-C3D model is characterized by comprising the following steps:

2) Dividing video data into image frames through an algorithm, then cutting the image size of the image frames to ensure the size of the images to be consistent, and keeping a corresponding number of image frame samples;

3) Fusing a non-local module in a C3D network to obtain an NL-C3D network model;

4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier.

2. The NL-C3D model-based tower crane operator abnormal behavior detection method according to claim 1, wherein the video data set collection process in the step 1) is as follows: taking an operation video by a camera, wherein the resolution is greater than 320 pixels × 240 pixels, the frame rate is greater than 25 frames/second, then, cutting one frame of the data set every four frames to obtain a frame image, manually reducing the sampling step size of the image until meeting the requirements of at least sixteen frames for equipment which cannot make the entering time width of the network structure exceed sixteen frames according to the interval number, and dividing the data set into a training set, a verification set and a test set according to the ratio of 2.

3. The NL-C3D model-based tower crane operator abnormal behavior detection method as claimed in claim 1, wherein the step 2) of cutting out the image size of the image frame and keeping the corresponding number of image frame samples comprises the specific steps of randomly cutting the image frame into 112 x 112 pixels in order to enhance the safety and precision of the model during the input processing, then determining the initial address of the video frame of the selected network in the output video frame, then selecting sixteen network input video frames through a sliding window at the address, wherein the size of the selected video frame is 3 x 16 x 112, simultaneously using random inversion and subtraction operations sequentially performed along the three paths of the RGB image frame to realize the processing of data enhancement, and finally, using a graphic marking software Labelimg to mark the graphic, and marking the graphic with abnormal behaviors of "calling", "smoking", "mobile phone playing" and "dozing".

4. The NL-C3D model-based tower crane operator abnormal behavior detection method according to claim 1, wherein the step 3) of fusing a non-local module in a C3D network to obtain the NL-C3D network model comprises the following steps:

3.1 The original C3D network model adopts 3D convolution and 3D pooling as main bodies, each main body is composed of 8 channels of 64, 128, 256, 512, 3D convolution layers, 5 3D pooling layers, 2 full-connection layers and a softmax classifier, and when the convolution parts are fused with the non-local neural network, compared with the original C3D network model, the non-local network is fused with the convolution layers as a whole in a residual error connection mode, so that each convolution layer is fused with the non-local neural network module;

3.2 The shape of the input X of the C3D convolution module is T, H, W and C, wherein T is the number of channels of an image, H is the length of a video frame, W is the height of the video frame, and C is the length of the video frame, and after the non-local neural network module is fused in the C3D network model, the input X is respectively input with theta, W and C,

The convolution module of g, theta,

3.3 Is a) is a reaction of theta with

Of (2) aObtaining a matrix of (C, C) by matrix addition of the dimension calculation result, then completing normalization analysis by Softmax, and completing further matrix multiplication of the analyzed result and the calculation result after g branch dimension change;

5. The NL-C3D model-based tower crane operator abnormal behavior detection method according to claim 1, wherein the step of importing the image frames into the NL-C3D network model for training and checking in the step 4) comprises the steps of:

where x, y represent input features and output features, respectively, corresponding to feature images in graphics and videoI represents the current position code of the characteristic point, and j represents the codes of other characteristic points in the characteristic image; function f (x) _i ,x _j ) Then represents x _i And x _j The correlation degree between the two is described, namely the smaller the f value is, the smaller the interference degree of j to i is; g (x) _j ) Is a linear combination function that provides the features of the graph at j; c (x) is a normalization parameter, f (x) _i ,x _j ) Is a gaussian function, as shown in equation (2):

the normalization factor C (x) is expressed as shown in equation (3):

Z _i ＝W _z Y _i +X _i (4)，

wherein Y is _i Obtained by the operation of formula (1), W _z Then is the weight matrix,' + X _i The method comprises the following steps that residual connection is represented, and the space-time characteristics of the video are obtained in the residual connection mode without interfering with the original parameters and the initialization method in a model;

H(p,q)＝-∑p(x) log q(x) (5)，