CN113128360A

CN113128360A - Driver driving behavior detection and identification method based on deep learning

Info

Publication number: CN113128360A
Application number: CN202110343377.7A
Authority: CN
Inventors: 蔡沈健; 倪成润; 黄鹤; 张强; 沈纲祥
Original assignee: Suzhou Leda Nanotechnology Co ltd
Current assignee: Suzhou Leda Nanotechnology Co ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-07-16

Abstract

The invention discloses a driver driving behavior detection and identification method based on deep learning, which comprises the following steps: step 1, acquiring a video frame sequence of a driver in a driving process, wherein the video frame sequence comprises a behavior image of the driver in the driving process; step 2, preprocessing the video frame sequence; and 3, constructing a deep learning model consisting of ResNet-18, a multi-layer LSTM network and full-connection layer cascade connection, and detecting and identifying the preprocessed video frame sequence by using the deep learning model. The invention adopts the ResNet-LSTM network structure, eliminates the influence of gradient explosion or gradient disappearance caused by the increase of the network depth, introduces a channel attention, a space attention and a time sequence attention mechanism to the ResNet-18 and LSTM networks respectively, fully utilizes the space and time sequence information of videos and improves the performance of the model.

Description

Driver driving behavior detection and identification method based on deep learning

Technical Field

The invention relates to the field of deep learning and image processing, in particular to a driver driving behavior detection and identification method based on deep learning.

Background

In recent years, video behavior recognition technology is beginning to be applied to more and more fields including video monitoring, vehicle tracking, behavior recognition and the like, and driver driving behavior recognition also depends on the video behavior recognition technology. Video behavior recognition is a further development of deep learning in the field of image recognition. Meanwhile, with the continuous improvement of computer hardware technology, especially the rapid development of the GPU, the image recognition algorithm based on deep learning gradually becomes the mainstream algorithm.

The research of video behavior recognition mainly comprises designing more efficient deep learning models and learning algorithms. Considering from two aspects of spatial characteristics and temporal characteristics, a model needs to have strong characteristic learning capacity, a proper network structure needs to have good generalization performance to meet the requirement of practical application, sensitivity to bad driving behaviors in different environments is needed, and the ratio of missed report and wrong report is reduced. Generally, the training and recognition time of the deep learning model is long, and the requirement of real-time performance is difficult to meet from the algorithm point of view. In order to meet the application requirements, the designed deep learning model needs to be subjected to lightweight processing. In addition, the identification accuracy of the model is reduced while the parameter number is reduced. In the aspect of implementation of portable terminal equipment, the problems of high identification accuracy, high calculation speed, low equipment power consumption, memory requirement and the like need to be solved.

Therefore, according to practical problems, it is very important to design a deep learning model for detecting the bad driving behavior of the driver and meeting the real-time requirement, wherein the deep learning model has high recognition accuracy.

For identifying the driving behavior of a driver, the related research is mainly focused on improving the depthThe learning model has the capability of extracting the spatial features of video data and the time sequence features of videos, different network structures have different feature learning capabilities, and different networks are properly combined^[1]Different effects can be achieved. At present, there are several mainstream deep learning models, including:

(1) convolutional neural network

The development of Convolutional Neural Network (CNN) in the image field is mature, and for a video composed of a series of continuous frames, CNN may also be used for learning to extract an explicit feature contained in the video frame^[2]. Classical CNN models are LeNet, AlexNet, GoogleNet, VGG, ResNet (residual network)^[3]And the like. For example, ResNet adopts the idea of residual module, and changes the feature representation h (x) learned by the input x into residual learning f (x) ═ h (x) -x to solve the learning degradation problem caused by the increase of the network depth. When the residual is 0, the network only performs identity mapping, at least the performance of the network is not reduced.

Document [4]]FCNN (full connected neural networks) is combined with a three-level cascaded deep convolutional neural network to identify driver violations. Firstly, the data is semantically segmented by utilizing the FCNN, then a large number of normal driving behaviors are eliminated by the former two stages of convolution networks, and a classification result is given by the last stage of convolution networks. Document [ 5]]The method for classifying by adopting multi-network feature fusion is characterized in that a parallel network is formed by combining three different convolution networks of ResNet, VGG16 and Incepotion, feature extraction and feature fusion are carried out on the same data at the same time^[6]And finally, the data is input into a classifier to detect and identify whether the driver is in a doze state or not. Document [7 ]]Improvement of RCNN (region CNN) SFRCNN (spectral master RCNN) is proposed to detect the driving behavior of a driver.

To be able to capture motion in video, the dual-stream method uses RGB images and optical flow (optical flow) as model inputs, and feature fusion is performed at the end. The dual-stream method uses the change of pixels in the image sequence in the time domain and the correlation between adjacent frames to find the existing pair between the previous frame and the current frameThe relationship is used to calculate the motion information of the object between adjacent frames, and therefore the optical flow also includes the time sequence relationship between frames. TSN^[8]The (Temporal segment network) is based on a general dual-stream structurally improved framework, and can improve the classification effect on long videos. Document [9 ]]The VGG16 is used as a framework, an RGB image and optical flow parallel input mode is adopted, three different feature fusion strategies are discussed, and recognition and detection of behaviors such as smoking and making a call of a driver are achieved.

CNN-RNN is a cascade of convolutional and recurrent neural networks. RNN can be used in the fields of timing prediction, speech recognition, etc., but it has the problems of gradient disappearance, memory overflow, etc., so that a variant LSTM (Long short-term memory) or GRU model is generally adopted as the RNN model for driver driving behavior detection. Firstly, CNN extracts spatial features from input stacked video frames, then RNN models are used for time sequence modeling, and the result of the last moment is used as output. In order to improve the performance of the CNN-RNN network, key frames are extracted, an attention mechanism is often added to the model, and the LSTM network acquires and learns the importance between different sequence frames in a weight manner. Document [10] first cuts out an area of interest (e.g., an eye area) of an image using MTCNN (Multi-task masked volumetric networks), and then detects whether a driver has fatigue driving via LSTM time modeling using improved residual network extraction features.

The input of the conventional 2D convolution operation is four-dimensional [ batch, height, width, channel ], and the input time information is usually lost after the operation. The input of the 3D convolution operation is five dimensions [ batch, depth, height, width, channel ], and the convolution and pooling operations are performed on the space-time, so that time modeling can be well performed. In addition, 3D convolution takes a complete video frame as input and does not rely on any pre-processing and is therefore easily scalable to large data sets. Document [11] realizes detection of driver distraction by 3D convolution using inclusion-V1 as a skeleton, and achieves a 94.4% accuracy in State Farm dataset.

Reference to the literature

[1]Wang Y,Ho I W H.Joint deep neural network modelling and statistical analysis on characterizing driving behaviors[C].2018IEEE Intelligent Vehicles Symposium(IV).IEEE,2018:1-6.

[2]Valeriano L C,Napoletano P,Schettini R.Recognition of driver distractions using deep learning[C].The 8th International Conference on Consumer Electronics,IEEE,2018:1-6.

[3]Christoph R P W,Pinz F A.Spatiotemporal residual networks for video action recognition[J].Advances in Neural Information Processing Systems,2016: 3468-3476.

[4] Lijun, Yan Huamin, Zhang 28557yu, et al. driver violation behavior recognition based on neural network fusion [ J ] computer applications and software, 2018,35:12.

[5]Vijayan V,Sherly E.Real time detection system of driver drowsiness based on representation learning using deep neural networks[J].Journal Of Intelligent&Fuzzy Systems,2019,36(3):1977-1985.

[6]Liu F,Li X,Lv T,et al.A Review of driver fatigue detection:Progress and prospect[C].2019IEEE International Conference On Consumer Electronics,IEEE, 2019:1-6.

[7]Ulhaq A,He J,Zhang Y.Deep actionlet proposals for driver's behavior monitoring[C].2017International Conference On Image and Vision Computing, IEEE,2017:1-6.

[8]Wang L,Xiong Y,Wang Z,et al.Temporal segment networks:Towards good practices for deep action recognition[C].European Conference On Computer Vision.Springer,Cham,2016:20-36.

[9]Hu Y,Lu M Q,Lu X.Spatial-temporal fusion convolutional neural network for simulated driving behavior recognition[C].201815th International Conference On Control,Automation,Robotics andVision.IEEE,2018:1271-1277.

[10]Xiao Z,Hu Z,Geng L,et al.Fatigue driving recognition network: Fatigue driving recognition via convolutional neural network and long short-term memory units[J].IET Intelligent Transport Systems,2019,13(9):1410-1416.

[11]Moslemi N,Azmi R,Soryani M.Driver distraction recognition using 3D convolutional neural networks[C].20194th International Conference On Pattern Recognition and Image Analysis(IPRIA).IEEE,2019:145-151.

At present, a deep learning method is mainly adopted for driver behavior recognition, and a generally designed deep learning model is complex in structure, large in parameter quantity and calculation quantity and not suitable for being realized on portable equipment. These prior art techniques suffer from several disadvantages:

(1) the network model based on the CNN can only learn the spatial characteristics of a single-frame image, and neglects the time sequence characteristics, so the identification effect is not ideal.

(2) The double-flow method has better accuracy in video identification, but the double-flow method is generally formed by combining two identical networks and has a complex structure, and meanwhile, a large amount of resources are occupied when the optical flow of a video frame is calculated, which is not practical for mobile equipment with limited memory.

(3) The 3D network has the capability of extracting both spatial features and temporal features, but the number of parameters is too large because five-dimensional input is used as compared with 2D convolution.

Disclosure of Invention

The invention aims to solve the technical problem of providing a driver driving behavior detection and identification method based on deep learning, improving the extraction capability of spatial features and time sequence features of video data and realizing accurate classification; due to the limited resources of the mobile equipment, the structure of the deep learning model is optimized and the parameters are deleted under the condition of ensuring the accuracy of detection and identification; the designed driver driving behavior recognition system can meet the real-time requirement; the memory resources occupied by the mobile equipment calling the network solidification model are effectively reduced through structural optimization.

In order to solve the technical problem, the invention provides a driver driving behavior detection and identification method based on deep learning, which comprises the following steps:

step 1, acquiring a video frame sequence of a driver in a driving process, wherein the video frame sequence comprises a behavior image of the driver in the driving process;

step 2, preprocessing the video frame sequence;

and 3, constructing a deep learning model consisting of ResNet-18, a multi-layer LSTM network and full-connection layer cascade connection, and detecting and identifying the preprocessed video frame sequence by using the deep learning model.

In one embodiment, the ResNet-18 network consists of five parts, the first part consisting of a convolutional network and a max-pooling layer; the lower four parts are convolution networks with the same structure, each part is provided with four convolution layers, and the number of output channels of each part is doubled in sequence; and directly adding the input and the output of each two layers of convolution layers to form a residual error module, wherein the number of input channels and the number of output channels of the residual error module respectively correspond to the number of input channels and the number of output channels of the partial feature diagram.

In one embodiment, the ResNet-18 network is used for extracting the spatial features of the video frame sequence, and a convolution attention module is added in each residual module of the ResNet-18 network, and the convolution attention module respectively weights the feature map on the channel and the space.

In one embodiment, the input of the channel convolution attention module on a channel is an H × W × C feature F, and first, global average pooling and maximum pooling operations are performed on each input channel feature respectively to obtain two 1 × 1 × C channel descriptions; then, respectively sending two 1 × 1 × C channel descriptions into a two-layer convolutional neural network, wherein the number of neurons in the first layer is C/r, r is a scaling factor, an activation function is ReLU, the number of neurons in the second layer is C, then adding the obtained two features, mapping the added features through a Sigmoid nonlinear function to obtain a weight vector Mc with the dimension of C and the value of 0-1, each vector element value reflects the importance degree of the channel, and multiplying the original feature F by a weight coefficient to obtain a scaled feature; the calculation formula of Mc is:

where A is the global average pooling operation, M is the maximum pooling operation, G is the convolution operation, W₀Is the first layer convolution layer weight, the first layer convolution operation is followed by the ReLU activation function, W₁σ is the Sigmoid function for the second layer convolution operation weight.

In one embodiment, the input of the spatial convolution attention module is an H × W × C feature F, maximum pooling and average pooling operations are performed on channels respectively to obtain two H × W × 1 channel descriptions, two matrices are cascaded on the channels, that is, the matrices are overlapped in the last dimension, a spatial feature weight coefficient Ms with values distributed between 0 and 1 is obtained through convolution of 7 × 7 and a Sigmoid function, and each element value represents the importance of a corresponding area feature and is multiplied by a feature map; the calculation formula of Ms is:

wherein A is the average pooling operation, M is the maximum pooling operation, f^7x ⁷Is a convolution operation with a convolution kernel size of 7x7,

respectively, the feature matrices obtained after the average pooling and the maximum pooling, and σ is a Sigmoid function.

In one embodiment, the LSTM network is configured to extract timing information of the sequence of video frames, the LSTM network has three layers, and each layer of LSTM input is processed by a coefficient weighting method through a timing attention module.

In one embodiment, the input of the time sequence attention module is a T multiplied by C characteristic F, firstly, a dimension is added at the forefront for facilitating convolution operation and is changed into 1 multiplied by T multiplied by C, and then the matrix dimension is readjusted to be [1, C, T]Then, performing global average pooling and maximum pooling on each time sequence feature respectively to obtain two 1 × 1 × T time sequence descriptions; then two 1 are preparedRespectively sending 1 × T time sequence description into a two-layer convolutional neural network, wherein the number of neurons in a first layer is T/r, r is a scaling factor, an activation function is ReLU, the number of neurons in a second layer is T, adding the obtained two features, mapping the added features through a Sigmoid function to obtain a weight vector M with dimension of T and value of 0-1_TThen readjusted to the original dimension [ T, 1]]Multiplying the original characteristic F by a weight coefficient to obtain a scaled characteristic; m_TThe calculation formula of (a) is as follows:

where A is the average pooling operation, M is the maximum pooling operation, G is the convolution operation, W₀And W₁The weights of the first and second convolutional layers respectively,

and

respectively, the feature matrices after two pooling, and sigma is Sigmoid function.

Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.

Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.

The invention has the beneficial effects that:

(1) the invention adopts a ResNet-LSTM network structure, eliminates the influence of gradient explosion or gradient disappearance caused by the increase of the network depth, improves the idea that the traditional LSTM network adopts a residual error module, fully utilizes the time sequence information and improves the network performance.

(2) The invention adopts the deep separable convolution to delete the parameters of the network and combines the parameter with the model compression to optimize the network structure, thereby ensuring the sufficient learning capability of the network, reducing the network scale, improving the operation rate of the model and being easier to deploy on the mobile equipment.

(3) The CBAM attention module in the convolutional network is introduced into the LSTM, and the time sequence attention module TBAM is provided, so that the identification accuracy and the convergence speed of the network are effectively improved.

(4) The invention adopts a mixed replacement mode for the CNN network, combines the advantages of different algorithms and carries out regional improvement on the spatial feature extraction network.

Drawings

Fig. 1 is a flow chart of the driver driving behavior detection and identification method based on deep learning according to the invention.

Fig. 2 is a diagram of a ResNet-18 network architecture in accordance with the present invention.

FIG. 3 is a Block1 structure diagram in the present invention.

FIG. 4 is a Block2 structure diagram in the present invention.

FIG. 5 is a Block3 structure diagram in the present invention.

FIG. 6 is a Block4 structure diagram in the present invention.

FIG. 7 is a flow chart illustrating the channel attention operation in the present invention.

Fig. 8 is a schematic view of the spatial attention operation flow in the present invention.

Fig. 9 is a schematic flow diagram of the separable convolution operation in the present invention.

Fig. 10 is a schematic diagram of the structure of TBAM in the present invention.

FIG. 11 is a flow chart of the timing attention operation in the present invention.

FIG. 12 is a training flow diagram in the present invention.

Fig. 13 is a configuration diagram of a driving behavior recognition network in the present invention.

FIG. 14 is a schematic diagram of the combination of CBAM and residual module in the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

The invention provides a scheme for realizing the detection and identification of the driving behavior of a driver by cascading a ResNet-18 network and an LSTM network, and the technical scheme is given by a flow chart shown in figure 1.

The method adopts the raspberry pi as the vehicle-mounted equipment, trains the designed deep learning model by using the collected data set before being transplanted to the raspberry pi, and saves the trained network as the checkpoint model. Since the type file stores the network model and the network parameters as different parts, which is not beneficial to the calling of hardware equipment, the model needs to be converted into a solidified PB file, so as to achieve the purpose that the network weight and the network model are stored in the same file. The steps are just the conversion of the model file, and the file size is not reduced. In order to improve the running speed of the raspberry pi and reduce the memory resource called by the model, the PB model needs to be converted into a tensrflow Lite model suitable for mobile end devices such as android and micro processors. Meanwhile, the size of the model file is further reduced by using a model quantization mode, and the size of the model is changed into a half by using a conversion scheme of converting the data type of float32 into the data type of float 16.

Step 1, starting from the start of the vehicle-mounted equipment, intercepting and storing continuous behavior images of a driver in the driving process for two to three seconds in real time in a mode of fixing the number of frames at intervals by a camera.

And 2, carrying out image processing on the video frame sequence, wherein the video frame pixel intercepted by the camera is 640x480, clipping is carried out to meet the requirement of network input, the video frame pixel is zoomed into 224x224, meanwhile, graying and normalization processing are carried out on the picture, and the obtained network input x is [ batch, height, width, channel ].

And 3, constructing a deep learning model for driver driving behavior recognition, which is formed by ResNet-18, a multi-layer LSTM network and a full connection layer cascade. Resnet-18 is composed of five parts, the first part is composed of a convolution network with a convolution kernel of 7x7, channel number of 64 and step length (stride) of 2 and a maximum pooling layer (Max-pooling) with a convolution kernel of 3x3 and stride of 2, the lower four parts are convolution networks with roughly the same structure, each part is provided with four convolution layers, the size of the convolution kernel is 3x3, and the number of output channels of each part is doubled in turn and is respectively 64,128,256 and 512. The input and output of every two layers of convolution layers are directly added to form a residual module, and the number of input channels and the number of output channels of the residual module respectively correspond to the number of input channels and the number of output channels of the partial characteristic diagram. Starting from the third part, stride of the first convolutional layer is 2, and the residual part reduces the size of the feature map, and the convolutional layer with convolution kernel of 1 × 1 and stride of 2 is used. The final number of output channels from the network is 512. Fig. 2 to 6 are structural diagrams of Resnet-18 and respective modules.

The ResNet-18 network is used to extract the spatial features of a sequence of video frames. In order to improve the feature extraction capability of the Convolutional network, a Convolutional Attention Module (Convolutional Block Attention Module) is added in each residual Module of the ResNet-18 network. The CBAM weights the feature map on the channel and spatially, respectively.

The input of the channel attention module is a H multiplied by W multiplied by C feature F, and the global average pooling and the maximum pooling are firstly carried out on each channel feature respectively to obtain two 1 multiplied by C channel descriptions. And then, respectively sending the two characteristics into a two-layer convolutional neural network, wherein the number of neurons in the first layer is C/r, r is a scaling factor, an activation function is ReLU, the number of neurons in the second layer is C, then adding the two obtained characteristics, mapping the added characteristics through a Sigmoid function to obtain a channel weight vector Mc with the dimension of C and the value of 0-1, and multiplying the original characteristic F by a weight coefficient to obtain the scaled characteristics. The calculation formula of Mc is:

wherein A isLocal average pooling operation, M being maximum pooling operation, G being convolution operation, W₀Is the first layer convolution layer weight, the first layer convolution operation is followed by the ReLU activation function, W₁σ is the Sigmoid function for the second layer convolution operation weight.

FIG. 7 is a related operational flow:

the spatial attention module performs channel maximum pooling and average pooling on the feature map F to obtain two H multiplied by W multiplied by 1 channel descriptions, two matrixes are cascaded on the channel, namely the matrixes are superposed on the last dimension, a spatial feature weight coefficient Ms with the value distributed between 0 and 1 is obtained through convolution of 7 multiplied by 7 and a Sigmoid function, and the spatial feature weight coefficient Ms is multiplied by the feature map. The calculation formula of Ms is:

wherein A is the average pooling operation, M is the maximum pooling operation, f^7x7Is a convolution operation with a convolution kernel size of 7x7,

respectively, the feature matrices obtained after the average pooling and the maximum pooling, and σ is a Sigmoid function. The detailed flow chart is shown in fig. 8.

In the implementation process of ResNet-18, in order to improve the operation speed of the network, optimize the network structure and reduce the network parameters, separable convolution is adopted to replace the traditional convolution operation, deep convolution and point convolution are used to replace the original 3x3 convolution, meanwhile, by taking Mobile V2 as a reference, in the downsampling stage of a residual module, a linear function is used to replace the original ReLU activation function of the last layer of convolution, and the probability that the ReLU function causes a large number of output neurons to die (namely, the weight of the neurons is 0) is effectively reduced. The specific operation is shown in fig. 9.

The convolution of 3x3 on the right side of the upper graph is deep convolution, the convolution of 1x1 below is point convolution, the traditional convolution operation is replaced by deep separable convolution, the original effect can be achieved, the parameter quantity of the network can be greatly reduced, and meanwhile, a linear function is used as an activation function, and partial effective information in the characteristics can be effectively prevented from being ignored.

ResNet-18 outputs [ batch time _ steps,1, 512] after the last convolutional layer and the 7 × 7 max pooling layer. To satisfy the input of LSTM, the final output of the ResNet-18 model is readjusted to [ batch, time _ steps,512 ]. The LSTM network is used to extract timing information of a video frame sequence, and has three layers, each layer has 512 hidden layer neurons, and each layer of LSTM input is subjected to coefficient weighting operation by a Timing Block Attention Module (TBAM). The structure of the time sequence attention module is shown in fig. 10.

The time sequence attention module inputs a T multiplied by C characteristic F, firstly, a dimensionality is added at the forefront to conveniently carry out convolution operation to be 1 multiplied by T multiplied by C, and then the matrix dimensionality is readjusted to be [1, C, T]Then, global average pooling and maximum pooling are performed separately for each timing feature to obtain two 1 × 1 × T timing descriptions. Then, the two characteristics are respectively sent into a two-layer neural network, the number of neurons in the first layer is T/r, r is a scaling factor, an activation function is ReLU, the number of neurons in the second layer is T, the two obtained characteristics are added and pass through a Sigmoid function to obtain a weight vector M with the dimension of T and the value of 0-1_TThen readjusted to the original dimension [ T, 1]]And multiplying the original characteristic F by a weight coefficient to obtain a scaled characteristic. M_TThe calculation formula of (a) is as follows:

and

respectively, the feature matrices after two pooling, and sigma is Sigmoid function. The block diagram is shown in fig. 10.

The local attention different from the spatial convolution only pays attention to a certain region of one image, the time sequence attention module screens all video frames on the basis of the global level, namely the video level, and when the video frame characteristics corresponding to a certain time sequence are obvious, the contribution degree of the video frame characteristics to the network is higher, so that the network can learn the characteristics of the video frame characteristics in a key mode, the characteristic extraction capability of the network is improved, and the convergence speed of the network is accelerated.

The operational flow of the time-series attention mechanism is shown in fig. 11:

the LSTM model generally takes the last time sequence as output. In the invention, a time pooling mode is adopted as final output, and all time sequence characteristics are summed and then averaged to be K [ sum/time _ steps,512 ]. If there are 15 frames, then it is [1,512 ].

The fully connected layer has five neurons representing five categories of recognition tasks, and the input sequence is X { X }₁,X₂,X₃....X_TWherein each X is_i{x₁,x₂,x₃...x_tIndicates that each video contains T feature distributions, for a total of T video inputs, and Z { Z } via the full connection layer output₁,Z₂,Z₃...Z_T}，Z_iIs { z₁,z₂,z₃,z₄,z₅}。

Is calculated by the formula

Z＝FC(X)＝W·X,W∈R^t×5

Where X is the input feature, the dimension is T × T, FC is the full join operation, W is the weight matrix of dimension T × 5, Z is the full join output, and the dimension is T × 5.

Step 4, obtaining the corresponding category Y { Y ] through a softmax layer₁,Y₂,Y₃...Y_TIs calculated by the process of

Y_i＝max(softmax(z₁)，softmax(z₂)，softmax(z₃)，softmax(z₄)，softmax(z₅))

The training and transplanting process of the deep learning model for driver driving behavior recognition provided by the invention comprises the following steps:

1. firstly, the collected video data set is cut frame by frame, 15 frames of images containing the driving behavior of a driver are selected and stored in a folder named by video names in the format of action _ serial number _ label. mp4, and all the video names are written into txt text.

2. The text document containing the video name is read line by line, the label category of the video is extracted, and then all data is scrambled.

3. Setting training batch size to 13, initial learning rate l_rAt 0.001, 15 frames of images under each video file are read, pixels reduced to 224x224 are cut out, and the video frames are subjected to gray scale and normalization processing to obtain the dimension of the training input of 13x15,224 and 3]Dimension of the label is [13, 5]]。

4. The learning rate attenuation is 0.001l after one thousand training by adopting an Adam optimization algorithm_rAnd continues training until the network converges.

5. After the network training is finished, the model parameters of the network are stored in a checkpoint format and converted into a PB format, the network structure graph file and the weight are combined into a file, and then the model is converted into a tflite model file through a tf. To improve operating efficiency, float16 was chosen as a quantification criterion for network parameters while converting.

6. And (5) transplanting the model into a raspberry pie and completing debugging.

The detailed flow chart is shown in fig. 12.

The invention will be further described with reference to figures 13 to 14 of the accompanying drawings:

1. and (4) turning on a raspberry power supply, loading a program and starting the camera.

2. The frame rate of the raspberry pi camera is 30FPS, 15 frames of images are extracted from one frame at intervals of 4 frames, and the images are displayed on a raspberry pi display screen in real time.

3. The pixel size of an image intercepted by the CSI camera is 640x480, the image is reduced to 224x224 by utilizing a resize function in OpenCV, meanwhile, graying is carried out, and the number of channels of the image is kept to be 3. In order to increase the computation speed of the network, the image pixels are uniformly divided by 255.0 for normalization.

4. With batch changed to 1, the processed video frame constitutes the input vector x [15,224, 3]. First, pass through a 7x7 convolution layer with step size of 2, output channel number of 32, and output of x [15,112, 32 ]]Then, it passes through a 3x3 maximum pooling layer with step size of 2 to reduce redundant information and output of x15, 56,56,32]. The method comprises the following steps of four convolution modules, wherein each module consists of four convolution layers, the characteristic size of the first module is unchanged, the number of output channels is 64, and the output characteristic is x₁[15,56,56,64]When the second module carries out the first convolution operation, the convolution step is 2, the output channel is 128, and the output x is₂[15,28,28,128]The third and fourth modules repeat the operation of the second module, and the output is x respectively₃[15,14,14,256]，x₄[15,7,7,512]Finally, the feature map is changed to x [15,1,1512 ] through the maximum pooling layer of 7x7]。

5. The output through the residual network is adjusted to x [1,15,512] to satisfy the input of LSTM in the format [ batch, n _ steps, input _ size ]. The LSTM network has three layers, each layer has the same structure, the number of hidden neurons in each layer is 512, and therefore the characteristic diagram output after passing through each layer is x [1,15,512 ]. The final average timing is taken as the final output of the network and x [: Σ/15: ], with the intermediate notation pooling all timing averages.

6. Before being connected to the fully-connected layer, the feature map is adjusted to x [1,512], the fully-connected layer has only one layer, the weight matrix dimension is 512x5, and the output is x [1,5 ].

7. And finally, obtaining output y [1,5] through a softmax layer, wherein the second dimension represents the category number of the action, each element is a probability value between 0 and 1, the sum is 1, and the subscript of the maximum value is read to obtain the classification result of the action. 0 represents normal driving, 1 represents yawning, 2 represents calling, 3 represents smoking, and 4 represents line of sight deviation.

8. And sending out an alarm according to the identification result.

One specific application scenario of the present invention is as follows:

8. And sending out an alarm according to the identification result.

The key concept of the invention is as follows:

(1) according to the invention, an attention module CBAM for a convolutional network is introduced into a recurrent neural network LSTM, a time sequence attention module (TBAM) is provided, and a key time sequence is focused on from a global level, so that the characteristic extraction capability of the network on time sequence information is improved, and the convergence speed of the network is accelerated;

(2) the invention adopts the residual error module for the LSTM network, makes full use of semantic information between different layers of the LSTM network, and improves the performance of the network;

(3) the invention adopts a mixed replacement mode to delete the CNN network parameters, so that the total network parameters are reduced by more than seven times;

(4) random white Gaussian noise is added in the process of training the convolutional neural network, and Adam optimization, Dropout learning and L are used simultaneously₂And the regularization is used for preventing overfitting, so that the robustness of the method is improved.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A driver driving behavior detection and identification method based on deep learning is characterized by comprising the following steps:

step 2, preprocessing the video frame sequence;

2. The deep learning-based driver driving behavior detection and recognition method as claimed in claim 1, wherein the ResNet-18 network is composed of five parts, the first part is composed of a convolutional network and a max-pooling layer; the lower four parts are convolution networks with the same structure, each part is provided with four convolution layers, and the number of output channels of each part is doubled in sequence; and directly adding the input and the output of each two layers of convolution layers to form a residual error module, wherein the number of input channels and the number of output channels of the residual error module respectively correspond to the number of input channels and the number of output channels of the partial feature diagram.

3. The deep learning-based driver driving behavior detection and recognition method according to claim 2, wherein the ResNet-18 network is used for extracting spatial features of the video frame sequence, and a convolution attention module is added to each residual module of the ResNet-18 network, and the convolution attention module weights feature maps on a channel and a space respectively.

4. The deep learning-based driver driving behavior detection and recognition method according to claim 3, wherein the input of the channel convolution attention module on a channel is an H x W x C feature F, and global average pooling and maximum pooling are performed on each input channel feature to obtain two 1x C channel descriptions; then, respectively sending two 1 × 1 × C channel descriptions into a two-layer convolutional neural network, wherein the number of neurons in the first layer is C/r, r is a scaling factor, an activation function is ReLU, the number of neurons in the second layer is C, then adding the obtained two features, mapping the added features through a Sigmoid nonlinear function to obtain a weight vector Mc with the dimension of C and the value of 0-1, each vector element value reflects the importance degree of the channel, and multiplying the original feature F by a weight coefficient to obtain a scaled feature; the calculation formula of Mc is:

5. The deep learning-based driver driving behavior detection and identification method as claimed in claim 3, wherein the spatial convolution attention module inputs an H x W x C feature F in space, performs maximum pooling and average pooling operations on the channels respectively to obtain two H x W x1 spatial descriptions, concatenates two matrices on the channels, i.e. the matrices are overlapped in the last dimension, obtains a spatial feature weight coefficient Ms with values distributed between 0 and 1 through convolution of 7x7 and a Sigmoid function, and each element value represents the importance of the corresponding region feature and is multiplied with the feature map; the calculation formula of Ms is:

6. The deep learning-based driver driving behavior detection and recognition method of claim 1, wherein the LSTM network is used to extract timing information of the sequence of video frames, the LSTM network has three layers, and each layer of LSTM input is processed by a timing attention module by a coefficient weighting method for timing.

7. The deep learning-based driver driving behavior detection and recognition method as claimed in claim 6, wherein the input of the time-series attention module is a T × C feature F, firstly adding a dimension at the forefront for convolution operation, changing the dimension into 1 × T × C, and then readjusting the matrix dimension to [1, C, T]Then, performing global average pooling and maximum pooling on each time sequence feature respectively to obtain two 1 × 1 × T time sequence descriptions; respectively sending two 1 × 1 × T time sequence descriptions into a two-layer convolutional neural network, wherein the number of neurons in a first layer is T/r, r is a scaling factor, an activation function is ReLU, the number of neurons in a second layer is T, adding the obtained two features, and mapping the added features through a Sigmoid function to obtain a weight vector M with the dimension of T and the value of 0-1_TThen readjusted to the original dimension [ T, 1]]Multiplying the original characteristic F by a weight coefficient to obtain a scaled characteristic; m_TThe calculation formula of (a) is as follows:

and

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.