CN113283338A

CN113283338A - Method, device and equipment for identifying driving behavior of driver and readable storage medium

Info

Publication number: CN113283338A
Application number: CN202110569233.3A
Authority: CN
Inventors: 肖卫初; 刘宏立; 马子骥; 陈伟宏; 孙长亮
Original assignee: Hunan University; Hunan City University
Current assignee: Hunan University; Hunan City University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-20

Abstract

The invention discloses a method, a device, equipment and a storage medium for identifying driving behaviors of a driver, wherein the method comprises the following steps: acquiring a driver image containing driving behaviors of a driver; processing the driver image by adopting a data enhancement technology of random cutting to obtain first image data in a three-dimensional tensor format; processing the first image data in the three-dimensional tensor format using a convolutional neural network to generate second image data; inputting the second image data into a constructed CS _ ResNet model; in the CS _ ResNet model, a channel attention module and a space attention module are serially connected and embedded in a residual error network; fusing local features by outputting a residual network model through a full connection layer to form global features, and then calculating the score of each category by using a classifier; and obtaining a driver behavior recognition result according to the score of each category. The method can reduce the complexity of model calculation and improve the identification accuracy of the system.

Description

Method, device and equipment for identifying driving behavior of driver and readable storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method, a device and equipment for identifying driving behaviors of a driver and a readable storage medium.

Background

With the improvement of living standard of people, automobiles become the most common transportation tools for people to go out. The automobile brings convenience to life of people and also brings a series of problems, such as automobile road traffic accidents, environmental pollution and the like, wherein the automobile road traffic accidents are concerned about the life safety of people, so that the automobile road traffic accidents are concerned.

The occurrence of the car traffic accident is attributed to driver distraction, car malfunction, bad weather, and the like, wherein more than 80% of the car traffic accidents are caused by the driver distraction. Driver distraction behavior refers to the behavior that the driver takes to divert driving attention, such as: doze, call, send short messages, smoke, etc. Driver behavior is a main aspect affecting driving safety, and driver distraction behavior is a research hotspot of safe driving.

The national center for disease prevention and control divides driver distraction behavior into cognitive distraction, visual distraction, and manual distraction. Cognitive distraction behavior refers to the driver's deviation from thinking driving. Visual distraction means that the driver has eyes away from the road during driving, e.g. dozing. Manual distraction behavior is a variety of activities related to the deviation of the driver's body from the driving device. For example: when making a call with either the left or right hand, the driver's hand is far from the steering wheel. The driver turns to talk to the passenger and his head is offset from the front of the vehicle. To understand these distracting behaviors, the driver's state information, such as hand movements, eye gaze, head pose, and foot dynamics, must be captured. Driver behavior recognition is studied from the aspects of data sets, models, algorithms, and the like to improve accuracy. In most existing methods, specific features are usually extracted from the original image in advance. For example, adjusting the behavior of a radio requires attention to the gaze direction of the eyes. The act of making a call may be focused on the hand position and shape of the handset. However, these features are not always readily available.

Driver distraction behavior identification has been extensively studied over the last 20 years. The way in which the characteristics of the action are identified is key to understanding the behavior of the driver. From the viewpoint of model construction, methods of driver behavior recognition can be classified into a conventional method, a shallow machine learning method, and a deep learning method.

In the study of traditional methods, researchers have focused primarily on manually capturing physiological signals of features including head pose, eye gaze, facial expressions, foot dynamics, hand motion. The system uses physiological sensors to detect physiological signals of the driver, such as electroencephalograms (EEG), electrocardiograms, and electrooculograms. These features may be designed by domain experts to be selectively extracted for specific tasks. This occurs because the driver's state information or characteristics of the driver's behavior contain important clues for behavior recognition using conventional methods. Electroencephalograms show their affinity to driver behavior and can be used to identify driver behavior. In conventional approaches, scale-invariant feature transform (SIFT) and Histogram of Oriented Gradients (HOG) are well-known two-dimensional feature descriptors for image classification. For behavior recognition, SIFT and HOG may be extended to extract features of three-dimensional data, denoted SIFT-3D and HOG3D, respectively. Although behavior recognition can be achieved using traditional methods, performance is limited due to the difficulty of manually extracting features of variable appearance and pose.

The shallow machine learning method is a machine learning method for automatically extracting data features so as to improve the accuracy of behavior recognition. For example: a Random Forest (RF) classifier classifies driver behavior. The method utilizes contour transformation to extract features, and adopts a PF classifier which has better performance than linear perceptron, K-neighbor and multilayer perceptron (MLP). Berri et al propose a support vector machine model to detect the position of the face and hands to identify whether the driver is using a cell phone. Craye et al use AdaBoost to classify driver distraction, with input images captured by a Kinect sensor. Under the environment with changed lighting conditions, the method of combining the HOG classifier and the AdaBoost classifier classifies the use of the mobile phone. Chiou et al propose a Hierarchical Driver Monitoring System (HDMS) that uses a sparse representation based on partial temporal face descriptors. The first layer of HDMS with sparse representation detects the driver's normal and abnormal behavior during driving, and the second layer of HDMS determines whether the driving behavior is drowsy or distracted. The technique of stacking and combining learners and aggregation and combination rules in the distracted driver detection system achieves good effects.

In recent years, with the successful application of deep learning in computer vision, a deep learning method of driver behavior recognition has been developed. For example: xing et al propose a deep CNN model for driver-related activity recognition that is capable of recognizing seven tasks. The model is based on the image segmentation result of the Gaussian mixture model, and the behavior of the driver is detected by utilizing deep models such as AlexNet, GoogleLeNet and ResNet. Yang et al propose a feed-forward neural network (FFNN) to identify seven driver behaviors. FFNN uses RF and maximum information coefficient methods to assess the importance of each driver feature to driver behavior recognition. Eraqi et al designed an integrated system in which the convolutional neural network was genetically weighted. The system has stronger robustness for recognizing the postures of the distracted driver, and the classifier based on the genetic algorithm has better classification precision. Chen et al propose a driver behavior analysis system that utilizes one ConvNet to obtain spatial features and another ConvNet to obtain driver motion information. The modal characteristics are classified using a converged network.

The core idea of the attention mechanism is to make the system learn attention, namely, to be able to ignore irrelevant information and focus on key information. Attention mechanisms were originally proposed in the field of visual images and were inspired by human attention mechanisms. When people look at an image, they do not actually look at every pixel of the entire image at once, but rather focus their attention on a particular portion of the image according to their needs. Furthermore, humans learn from images they have seen before, and they should focus their attention in the future. From the perspective of the attention area, attention can be divided into spatial, channel, and hybrid domains.

In recent years, deep learning methods with visual attention mechanisms have been developed. For example: mnih et al proposed a recurrent neural network model with attention mechanism to classify images, where a new layer of weights is added to identify key features of the image. Through learning and training, the DNN can learn the regions of interest in each new image. Wand et al propose a remaining attention network having a plurality of attention modules. An attention module comprising a mask and a main trunk branch is constructed on the basis of ResNet and inclusion. Hu et al propose a compression and excitation network for image classification that can adaptively recalibrate the channel feature response.

The inventor invents in the process of implementing the invention, and the above method has the problem of insufficient accuracy for classification tasks of driver distraction, which is caused by the following reasons:

1) some methodologies use global information but are weak at selectively emphasizing information features.

2) Some methods capture the spatial correlation between features through CNN and gaussian mixture models, obtaining a global receptive field of the driver's body. In the task of driver behavior recognition, usually only the most prominent image attributes are of interest. However, most approaches ignore this consideration.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method, a device, equipment and a readable storage medium for identifying the driving behavior of a driver, so as to improve the accuracy of the classification of the driving behavior.

In order to achieve the purpose, the invention provides the following technical scheme:

a driver driving behavior recognition method, comprising:

s101, acquiring a driver image containing driving behaviors of a driver;

s102, processing the driver image by adopting a data enhancement technology of random cutting to reduce the resolution of the driver image, thereby obtaining first image data in a three-dimensional tensor format;

s103, processing the first image data in the three-dimensional tensor format by using a convolutional neural network to generate second image data; the convolution operation adopts tensor-based three-dimensional convolution operation to reduce the dimensionality of the first image data and create invariance to small distortion and displacement;

s104, inputting the second image data into the constructed CS _ ResNet model; in the CS _ ResNet model, a channel attention module and a space attention module are connected in series and embedded in a residual error network, and the channel attention module and the space attention module utilize maximum pooling and average pooling to reduce model calculation complexity and improve system identification accuracy;

s105, fusing local features through a full connection layer to output a residual network model to form global features, and then calculating the score of each category by using a classifier;

and S106, obtaining a driver behavior recognition result according to the score of each category.

Preferably, the driver behavior includes 10 types: c0, safe driving, C1, right-handed information sending, C2, left-handed information sending, C3, right-handed telephone calling, C4, left-handed telephone calling, C5, radio adjusting, C6 wine drinking, C7, extending hands to the back, C8, hair touching/makeup, and C9, passenger talking.

Preferably, in step S102, the 1920 × 1028 × 3 original image is reduced to the 224 × 224 × 3 first image by random cropping.

Preferably, the convolutional neural network consists of a three-dimensional convolutional layer, a ReLU activation function, and a pooling layer; wherein:

in the three-dimensional convolutional layer of the convolutional neural network, each unit is connected to a local patch in the feature map of the previous layer by a set of weights called filter bank; in the jth feature map of the ith layer, the convolution value of the cell at position (x, y) is calculated as:

wherein, b_i，jIs the deviation of the jth feature map in the ith layer,

is the weight of the position (p, q) of the mth connected to the jth feature map in the ith layer;

is the value of the previous feature map location (x + P, y + q), P_iAnd Q_iThe height and width of the kernel, respectively;

then, carrying out nonlinear transformation on the convolution operation result through ReLU or Sigmoid; ReLU is a corrected linear unit defined as follows:

f(x)＝max(0,x)

wherein x is an input of a non-linear function, various feature mappings in the layer utilize different filter groups, and all units in the feature mappings share the same filter group;

the pooling layer is used to fuse similar features to reliably detect patterns; maximum pooling and average pooling are two typical pooling methods; for maximum pooling, the maximum of the local block of cells is calculated by pooling cells in the feature map; the average pooling method is to calculate the average of local blocks of cells, which can be moved by multiple rows or columns and used as input to adjacent pooled cells, thus enabling to reduce the dimensionality of the data and create invariance to small distortions and shifts.

Preferably, in step S104, in the CS _ ResNet model, two methods, namely average pooling and maximum pooling, are used to calculate the channel attention;

average pooling is used for compressing the input space dimension and learning the range of the target object;

max pooling is used to collect clues about unique object features;

for a given one intermediate feature mapping X_C∈R^C×H×WAverage pooling Z as input to the channel attention Module_aAnd maximum pooling Z_mThe calculation is as follows:

Z_m＝max{X_c(1，1),...，X_c(1，W)；X_c(2,1)，...,；X_c(2,W)；...；X_c(H，1),...,X_c(H，W)}

wherein Z is_aAnd Z_mRespectively representing an average pooled output and a maximum pooled output;

in order to fully capture the dependency relationship between channels, average pooling and maximum pooling operations are sequentially transmitted into a convolutional layer and a nonlinear transformation layer, and then the results of the convolutional layer and the nonlinear transformation layer are fused through a fusion module; wherein the fusion module is composed of a multilayer perceptron and can generate CA mapping Y_c＝R^C/r×1×1(ii) a Wherein the sensor is designed as a fully connected layer with a dimensionality reduction ratio r; a Sigmoid excitation mechanism is adopted, so that the model has flexibility;

the channel attention module is described as follows:

Y_c＝g_s(P_avg(X_c)+P_max(X_c))

wherein g is_sIndicating Sigmoid activation, + indicating a fully connected sensor, P_avgAnd P_maxAverage pooling and maximum pooling, respectively.

Preferably, step S104 further includes, in CS _ ResNet:

introducing a spatial attention module into the constructed model to highlight attention to the valuable area; wherein, in a feature mapping X_s∈R^C×H×WAn effective spatial attention module corresponds to Y using the spatial relationship of the features_s∈R¹ ^×H×WGiven input X of the spatial attention Module_sThe method sequentially comprises average pooling, maximum pooling, convolution operation and nonlinear transformation to obtain output Y of the space attention module_s(ii) a Thus, a feature map with a size of 1 × H × W is obtained by using the average pooling and the maximum pooling;

in particular, for generating a cross-channel 2D spatial attention module map, input X is given_sThe output of the compute spatial attention module is as follows:

Y_s＝g_s(C_at(P_max(X_s),P_avg(X_s)))

wherein g is_sRepresents a Sigmoid activation function, C_atIndicating a connection operation, P_avgIs average pooling, P_maxIs the maximum pooling;

for the CS _ ResNet model, which is a mixture of channel attention and spatial attention, if channel attention teaches what the model is paying attention to, spatial attention will allow the model to know where to pay attention to;

assume the input of CS _ ResNet is X_rThen the output is calculated as follows:

Y_r＝g_r((((g_r(X_r*k)*k)×Y_c)×Y_s)+X_r)

wherein g is_rRepresenting a ReLU activation function; k represents a convolution kernel; x, and + are convolution, multiplication, and addition operations, respectively. Y is_cIs the output of the channel attention module, and Y_sIs the output of the spatial attention module.

Preferably, in step S105, the global feature is formed by fusing the local features through the full connection layer, and then the score of each category is calculated by using the softmax classifier, so as to obtain the driver behavior recognition result.

The embodiment of the invention also provides a device for identifying the driving behavior of the driver, which comprises:

the image acquisition unit is used for acquiring a driver image containing the driving behavior of the driver;

a random cropping unit, configured to process the driver image by using a data enhancement technique of random cropping to reduce a resolution of the driver image, thereby obtaining first image data in a three-dimensional tensor format;

a convolution operation unit configured to process the first image data in the three-dimensional tensor format using a convolution neural network to generate second image data; the convolution operation adopts tensor-based three-dimensional convolution operation to reduce the dimensionality of the first image data and create invariance to small distortion and displacement;

the input unit is used for inputting the second image data to the constructed CS _ ResNet model; in the CS _ ResNet model, a channel attention module and a space attention module are connected in series and embedded in a residual error network, and the channel attention module and the space attention module utilize maximum pooling and average pooling to reduce model calculation complexity and improve system identification accuracy;

the classification unit is used for fusing all local features through a full connection layer to form a global feature by outputting the residual network model, and then calculating the score of each category by using a classifier;

and the identification unit is used for obtaining a driver behavior identification result according to the score of each category.

The embodiment of the present invention further provides a driver driving behavior recognition device, which includes a memory and a processor, where the memory stores a computer program, and the computer program can be executed by the processor to implement the above driver driving behavior recognition method.

Embodiments of the present invention further provide a computer-readable storage medium, which stores a computer program, where the computer program is executable by a processor of a device on which the storage medium is located, so as to implement the above-mentioned method for identifying driving behavior of a driver.

In summary, in the embodiment, the channel attention module and the spatial attention module are serially connected and embedded into the residual error network, the channel attention module fully captures the dependency relationship between channels, teaches what features the model pays attention to, and the spatial attention module instructs where the features the model pays attention to, and the channel attention module and the spatial attention module are combined to realize adaptive feature extraction, thereby reducing the model calculation complexity and improving the system identification accuracy.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flow chart of a method for identifying driving behavior of a driver according to a first embodiment of the present invention.

Fig. 2 is a working schematic diagram of a driving behavior recognition method for a driver according to a first embodiment of the present invention.

Fig. 3 is a schematic view of different driving behaviors.

Fig. 4 is an architecture diagram of a CS _ ResNet model according to a first embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a driving behavior recognition apparatus for a driver according to a second embodiment of the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and specific embodiments. This embodiment is merely an example of the present invention and does not include all embodiments.

Referring to fig. 1 and fig. 2, a first embodiment of the present invention provides a method for identifying driving behavior of a driver, which includes:

s101, obtaining a driver image containing the driving behavior of the driver.

In this embodiment, for example, a driver behavior video may be recorded by a built-in vehicle-mounted camera, and the video is divided into one frame of image with a size of 1920 × 1080, so as to obtain a driver data set; wherein, considering the need for training and verification of the model, in a possible implementation of the present embodiment, the driver data set has 17308 images in common, wherein 12978 training images and 4331 testing images.

The data set contains 10 classes of driver behavior: c0, safe driving, C1, right-handed information sending, C2, left-handed information sending, C3, right-handed telephone calling, C4, left-handed telephone calling, C5, radio adjusting, C6 wine drinking, C7, extending hands to the back, C8, hair touching/makeup, and C9, passenger talking. As shown in fig. 3.

Of course, it should be noted that in other embodiments of the present invention, different driver behaviors may be defined according to actual needs, and these schemes are all within the protection scope of the present invention.

S102, processing the driver image by adopting a data enhancement technology of random cutting to reduce the resolution of the driver image, thereby obtaining first image data in a three-dimensional tensor format.

For example, in the present embodiment, the input original image data is reduced from (1920 × 1028 × 3) to the first image data of (224 × 224 × 3) by processing the driver image data using the image enhancement technique of random cropping. Random clipping can change the size of a sample, improve the quality of a training data set, and enable training data to be as close to test data as possible, thereby improving the performance of a model.

S103, processing the first image data in the three-dimensional tensor format by using a convolutional neural network to generate second image data; wherein the convolution operation employs a tensor-based three-dimensional convolution operation to reduce dimensionality of the first image data and create invariance to small distortions and shifts.

In the present embodiment, the second image data in the three-dimensional tensor format from step S102 is processed using a Convolutional Neural Network (CNN) composed of a convolutional layer, a ReLU activation function, and a pooling layer.

The CNN is in the form of three two-dimensional arrays with RGB channels. In the convolutional layer of CNN, each unit is connected to a local patch in the upper layer eigenmap by a set of weights called filter bank. In the jth feature map of the ith layer, the convolution value of the cell at position (x, y):

wherein, b_i,jIs the deviation of the jth feature map in the ith layer,

is the weight of the position (p, q) of the mth connected to the jth feature map in the ith layer.

Is the value of the previous feature map location (x + P, y + q), P_iAnd Q_iRespectively the height and width of the kernel. The output of the convolution operation is then passed through a non-linear transformation (e.g., ReLU or Sigmoid). ReLU is a corrected linear unit defined as follows:

f(x)＝max(0,x)

where x is the input to the non-linear function. The various feature maps in the layer utilize different filter sets, and all elements in a feature map may share the same filter set. The reason is that local groups in the image data are typically highly correlated and local patterns can be easily detected. In the CNN model, convolutional layers are used to detect local connections of features from previous layers, and pooling layers are used to fuse similar features to reliably detect patterns. Maximum pooling and average pooling are two typical pooling methods. For maximum pooling, the maximum of the local block of cells is calculated from pooled cells in the feature map. The average pooling method is to calculate the average of local cell blocks. A block may be moved by a number of rows or columns and used as an input to an adjacent pooled cell. Thus, the dimensionality of the data is reduced and invariance to small distortions and shifts is created. Multi-stage stacking of convolution, non-linearity and pooling, backpropagating gradients are used across the entire depth network, which allows the weights in all filter banks to be trained.

S104, inputting the second image data into the constructed CS _ ResNet model; in the CS _ ResNet model, a channel attention module and a space attention module are serially connected and embedded in a residual error network, and the channel attention module and the space attention module utilize maximum pooling and average pooling to reduce model calculation complexity and improve system identification accuracy.

In this embodiment, the CS _ ResNet model consists of convolution, pooling, activation functions, channel attention and spatial attention, etc., where the channel attention module (CA) and the spatial attention module (SA) are serially connected embedded in the residual network.

In the present embodiment, the output result of step S103 is used as the input of CS _ ResNet, and the channel attention and the spatial attention utilize maximum pooling and average pooling to reduce the computational complexity of the model, which effectively solves the degradation problem, and improve the system identification accuracy.

The method for constructing the CS _ ResNet model specifically comprises the following steps:

first, two methods, average pooling and maximum pooling, are employed to efficiently calculate channel attention.

The average pooling can compress the spatial dimension of the input, learning the range of the target object. Maximum pooling may collect some clues about unique object features. This example combines the features of average pooling and maximum pooling to infer the attention of a good channel. Given an intermediate feature map X_C∈R^C×H×WAs input for CA, average pooling Z_aAnd maximum pooling Z_mThe calculation is as follows:

Z_m＝max{X_c(1,1),...,X_c(1,W)；X_c(2,1),...,；X_c(2,W)；...；X_c(H,1),...,X_c(H,W)}

wherein Z is_aAnd Z_mMean pooling output and maximum pooling output are indicated, respectively.

In order to fully capture the dependency relationship between channels, the output of the average pooling and the maximum pooling operations is sequentially transmitted into a convolutional layer and a nonlinear transformation layer, and then the results of the convolutional layer and the nonlinear transformation layer are fused through a fusion module; wherein the fusion module is composed of a multilayer perceptron and can generate CA mapping Y_c＝R^C/r×1×1. To lowerWith low model complexity, the sensor is designed as a fully connected layer with a dimensionality reduction ratio r. And a simple Sigmoid excitation mechanism is adopted, so that the model has flexibility. In summary, CA is described as follows:

Y_c＝g_s(P_avg(X_c)+P_max(X_c))

The SA module is then introduced into the constructed model to highlight the interest in the valuable area. In a feature map X_s∈R^C×H×WBy using the spatial relationship of the features, an effective SA corresponds to Y_s∈R^1×H×W. Input X given SA_sIt sequentially passes through average pooling, maximum pooling, convolution operation and nonlinear transformation to obtain the output Y of SA_s。

A feature map of size 1 xh × W can be obtained by using two pooling operations. Specifically, a 2D SA map across channels is generated. Given input X_sThe output of SA may be calculated as follows:

Y_s＝g_s(C_at(P_max(X_s),P_avg(X_s)))

wherein g is_sRepresents a Sigmoid activation function, C_atIndicating a connection operation, P_avgIs average pooling, P_maxIs the maximum pooling.

For the CS _ ResNet model where CA and SA are mixed, if CA teaches what the model notices, SA will allow the model to know where to notice. SA is a complement to CA. Due to its lightweight computing, note that the module can be integrated into the DNN model, such as a residual network. Fig. 4 depicts a CS _ ResNet model framework in which CA and SA serial connections are embedded in a residual network. Assume the input of CS _ ResNet is X_rThen the output is calculated as follows:

Y_r＝g_r((((g_r(X_r*k)*k)×Y_c)×Y_s)+X_r)

wherein g is_rRepresenting a ReLU activation function; k represents a convolution kernel; x, and + are convolution, multiplication, and addition operations, respectively. Y is_cIs the output of CA, and Y_sIs the output of the SA. In the sequential arrangement, the CS _ ResNet model is effective in driver behavior recognition.

And S105, fusing the local features through the full connection layer by using the residual network model output to form a global feature, and then calculating the score of each category by using a classifier.

In the embodiment, the global feature is formed by fusing the local features through the full connection layer, and then the score of each category is calculated by using the softmax classifier, so that the driver behavior recognition result is obtained.

Referring to fig. 5, a second embodiment of the present invention further provides a driving behavior recognition apparatus for a driver, including:

an image acquisition unit 210 for acquiring a driver image including a driving behavior of the driver;

a random cropping unit 220, configured to process the driver image by using a data enhancement technique of random cropping to reduce a resolution of the driver image, so as to obtain first image data in a three-dimensional tensor format;

a convolution operation unit 230 for processing the first image data of the three-dimensional tensor format using a convolution neural network to generate second image data; the convolution operation adopts tensor-based three-dimensional convolution operation to reduce the dimensionality of the first image data and create invariance to small distortion and displacement;

an input unit 240, configured to input the second image data to the constructed CS _ ResNet model; in the CS _ ResNet model, a channel attention module and a space attention module are connected in series and embedded in a residual error network, and the channel attention module and the space attention module utilize maximum pooling and average pooling to reduce model calculation complexity and improve system identification accuracy;

a classification unit 250, configured to fuse local features of the residual network model output through a full connection layer to form a global feature, and then calculate a score of each category using a classifier;

and the identifying unit 260 is used for obtaining a driver behavior identifying result according to the score of each category.

The third embodiment of the present invention also provides a driver driving behavior recognition apparatus, which includes a memory and a processor, wherein the memory stores a computer program, and the computer program can be executed by the processor to realize the driver driving behavior recognition method.

The fourth embodiment of the present invention also provides a computer-readable storage medium storing a computer program executable by a processor of an apparatus on which the storage medium is located to implement the above-described driver's driving behavior recognition method.

Illustratively, the computer programs described in the above embodiments may be partitioned into one or more modules, which are stored in the memory and executed by the processor to implement the present invention. The one or more modules may be a series of computer program instruction segments capable of performing particular functions.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the present embodiment by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, a text conversion function, etc.), and the like; the storage data area may store data (such as audio data, text message data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein the implemented module, if implemented in the form of a software functional unit and sold or used as a stand-alone product, can be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A driver driving behavior recognition method, characterized by comprising:

s101, acquiring a driver image containing driving behaviors of a driver;

2. The driver driving behavior recognition method according to claim 1, characterized in that the driver behaviors include 10 different types: c0, safe driving, C1, right-handed information sending, C2, left-handed information sending, C3, right-handed telephone calling, C4, left-handed telephone calling, C5, radio adjusting, C6 wine drinking, C7, extending hands to the back, C8, hair touching/makeup, and C9, passenger talking.

3. The driver driving behavior recognition method according to claim 1, characterized in that in step S102, the 1920 x 1028 x 3 original image is reduced to the 224 x 3 first image by random cropping.

4. The driver driving behavior recognition method according to claim 1, wherein the convolutional neural network is composed of a three-dimensional convolutional layer, a ReLU activation function, and a pooling layer; wherein:

wherein, b_i，jIs the deviation of the jth feature map in the ith layer,

f(x)＝max(0，x)

the pooling layer is used to fuse similar features to reliably detect patterns; maximum pooling and average pooling are two typical pooling methods; for maximum pooling, the maximum of the local block of cells is calculated by pooling cells in the feature map; average pooling is the calculation of an average of local blocks of cells, which may be moved by a number of rows or columns and used as input for neighboring pooled cells.

5. The driver driving behavior recognition method according to claim 4, wherein in step S104, in the CS _ ResNet model, two methods of average pooling and maximum pooling are employed to calculate the channel attention;

max pooling is used to collect clues about unique object features;

Z_m＝max{X_c(1，1)，...，X_c(1，W)；X_c(2,1),...,；X_c(2,W)；...；X_c(H,1),...,X_c(H,W)}

in order to fully capture the dependency relationship between channels, the output of the average pooling and the maximum pooling operations is sequentially transmitted into a convolutional layer and a nonlinear transformation layer, and then the results of the convolutional layer and the nonlinear transformation layer are fused through a fusion module; wherein the fusion module is composed of a multilayer perceptron and can generate CA mapping Y_c＝R^c/r×1×1(ii) a Wherein the sensor is designed as a fully connected layer with a dimensionality reduction ratio r; a Sigmoid excitation mechanism is adopted to enable the model to have flexibility;

the channel attention module is described as follows:

Y_c＝g_s(P_avg(X_c)+P_max(X_c))

6. The driver driving behavior recognition method according to claim 5, wherein step S104, in CS _ ResNet, further comprises:

introducing a spatial attention module into the constructed model to highlight attention to the valuable area; wherein, in a feature mapping X_s∈R^C×H×WBy using the spatial relationship of the features, an effective spatial attentionForce module corresponds to Y_s∈R^1×H×WGiven input X of the spatial attention Module_sThe method sequentially comprises average pooling, maximum pooling, convolution operation and nonlinear transformation to obtain output Y of the space attention module_s(ii) a Thus, a feature map with a size of 1 × H × W is obtained by using the average pooling and the maximum pooling;

Y_s＝g_s(C_at(P_max(X_s),P_avg(X_s)))

assume the input of CS _ ResNet is X_rThen the output is calculated as follows:

Y_r＝g_r((((g_r(X_r*k)*k)×Y_c)×Y_s)+X_r)

wherein g is_rRepresenting a ReLU activation function; k represents a convolution kernel; x, and + are convolution, multiplication, and addition operations, respectively; y is_cIs the output of the channel attention module, and Y_sIs the output of the spatial attention module.

7. The method for identifying the driving behavior of the driver as claimed in claim 1, wherein in step S105, the global feature is formed by fusing the local features through the full connection layer, and then the softmax classifier is used to calculate the score of each category, so as to obtain the identification result of the driving behavior.

8. A driver driving behavior recognition apparatus characterized by comprising:

and the classification unit is used for fusing the local features through a full connection layer to form a global feature by outputting the residual network model, and then calculating the score of each category by using the classifier.