CN113780385A

CN113780385A - Driving risk monitoring method based on attention mechanism

Info

Publication number: CN113780385A
Application number: CN202111001093.6A
Authority: CN
Inventors: 魏翼鹰; 陈威; 李志成; 邹琳; 张晖; 杨杰; 袁鹏举; 张勇; 文宝毅
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-10

Abstract

The invention relates to a driving risk monitoring method based on an attention mechanism. The method includes: acquiring a variety of driving behavior sample images including labeling information to form a driving picture data set, wherein the labeling information is the corresponding actual driving behavior classification After the feature extractor of the described driving picture data set is input to the neural network for feature extraction, it is input to the channel attention layer and the spatial attention layer of the neural network in turn, and the input to the neural network is formed after the feature matrix is formed. The top-level classifier outputs the predicted driving behavior classification; according to the actual driving behavior classification and the predicted driving behavior classification, the neural network is trained until convergence, and a fully trained neural network is determined. The invention can greatly improve the accuracy of recognition misjudgment rate, recognize distracted driving behaviors in real time, and can generalize to different scenarios with only a small number of samples, and more importantly, the real-time performance is fully guaranteed.

Description

Driving risk monitoring method based on attention mechanism

Technical Field

The invention relates to the technical field of driving risk identification, in particular to a driving risk monitoring method based on an attention mechanism.

Background

According to the 2018 report of the road safety condition of the world health organization, road traffic accidents are the eighth leading cause of death worldwide, 135 million people die of the traffic accidents every year, and up to 5000 million people are injured. The number of road accidents caused by driver distraction has increased year by year, with about 65% of critical collisions and 80% of vehicle collisions being caused by driver distraction. The driving fatigue and the driving distraction behavior are important factors causing driving risks, the driving risk behaviors are accurately identified in real time, the driver can be warned in time, and traffic accidents caused by the distraction driving behavior are effectively reduced.

At present, the method for detecting physiological characteristic parameters of a driver, such as electroencephalogram (EEG), Electrocardiosignal (ECG), Electromyogram (EMG) and the like, has high accuracy, but needs to wear professional equipment to interfere driving. Therefore, a camera installed in the cab is generally used to collect picture information of the driver so as to monitor the driving state of the driver. At present, the facial features of a driver are mainly recognized through various image recognition algorithms, and whether the mental state of the driver is tired or not is concerned so as to judge the distraction degree. The method is limited to judging whether the mental state of the driver is tired, only focuses on the local characteristics of the driver, and ignores other subjective distractions possibly occurring to the driver, such as call receiving and calling, water drinking and the like. Meanwhile, the accuracy of the image analysis method based on the facial features needs to be improved, if the eye features need to be accurately judged, an eye tracker needs to be worn, and the judgment subjectivity of fatigue indexes is high. Aiming at subjective distraction behaviors of drivers, some deep learning algorithms are developed at present, action pictures of the subjective distraction behaviors of the drivers are classified mainly through a neural network, the quantity of parameters of the neural network architecture adopted at present is too large, required training samples are large, training difficulty is high, calculation cost is high, good samples with labels are difficult to obtain, generalization capability is still influenced by the training samples, once external factors such as illumination conditions and the environment in a vehicle are changed, accuracy is greatly reduced, instantaneity is difficult to guarantee, sometimes, the real-time performance is even lower than that of a traditional algorithm, and the neural network cannot be used commercially. Therefore, how to improve the real-time performance and the high efficiency of the driving state monitoring and early warning is an urgent problem to be solved.

Disclosure of Invention

In view of the above, it is necessary to provide a driving risk monitoring method based on attention mechanism, so as to overcome the problem of poor real-time performance of driving state monitoring and early warning in the prior art.

The invention provides a driving risk monitoring method based on an attention mechanism, which comprises the following steps:

acquiring a plurality of driving behavior sample images containing marking information to form a driving picture data set, wherein the marking information is corresponding actual driving behavior classification;

inputting the driving picture data set into a feature extractor of a neural network for feature extraction, sequentially inputting the driving picture data set into a channel attention layer and a space attention layer of the neural network, forming a feature matrix, inputting the feature matrix into a top-layer classifier of the neural network, and outputting a predicted driving behavior classification;

and training the neural network to be convergent according to the actual driving behavior classification and the predicted driving behavior classification, and determining the neural network with complete training.

Further, the obtaining of the multiple driving behavior sample images of the annotation information and the forming of the driving picture data set include:

obtaining a plurality of driving behavior sample images containing the labeling information;

carrying out data enhancement processing on the multiple driving behavior sample images to form corresponding data enhanced images, wherein the data enhancement processing mode comprises at least one of random overturning, random rotating and random cutting;

and preprocessing the plurality of data enhanced images to form the driving picture data set.

Further, the feature extractor adopts an EfficientNet network which removes a full connection layer at the top layer of the model, and adopts a depth separable convolution module, an SE attention mechanism and a swish activation function.

Further, the channel attention layer and the spatial attention layer sequentially input to the neural network include:

inputting the mixed features extracted by the feature extractor into the channel attention layer, and determining a channel attention weight;

multiplying the channel attention weight by the mixed feature to determine a channel weighting feature;

inputting the channel weighting characteristics to the space attention layer, and determining a space attention weight;

and multiplying the channel weighted feature and the space attention weight value to determine the final feature matrix.

Further, the inputting the mixed features extracted by the feature extractor into the channel attention layer, and the determining the channel attention weight value includes:

inputting the mixed features into a channel global maximum pooling layer and a channel global average pooling layer in the channel attention layer respectively, and determining corresponding channel maximum features and channel average features;

inputting the channel maximum features and the channel average features into a fully connected layer in the channel attention layer;

and correspondingly adding the output characteristics of the full connection layers in the channel attention layers element by element, and determining the channel attention weight value through an activation function.

Further, the inputting the channel weighting feature into the spatial attention layer, and the determining a spatial attention weight value includes:

respectively inputting the channel weighting characteristics into a spatial global maximum pooling layer and a spatial global average pooling layer in the spatial attention layer, and determining corresponding spatial maximum characteristics and spatial average characteristics;

splicing the spatial maximum feature and the spatial average feature based on channel dimensions, and determining a spliced feature map;

and performing convolution operation on the spliced feature graph, reducing the dimension to form a feature graph with a single channel dimension, and then generating the space attention weight value through an activation function.

Further, the top-level classifier sequentially comprises a global average pooling layer and at least one fully connected layer.

Further, the training the neural network to converge, and determining a well-trained neural network includes:

keeping the model parameters of the feature extractor unchanged, iteratively training the model parameters of the channel attention layer, the space attention layer and the top-level classifier until the network converges, and determining a corresponding first training network;

training the model parameters of the feature extractor in the first training network, and iteratively training the model parameters of the feature extractor, the channel attention layer, the spatial attention layer and the top classifier at the same time until the network converges, and determining the neural network with complete training.

Further, in the model training process, the hyper-parameters of the model are optimized by adopting a hyper-band hyper-parameter adjustment optimization algorithm.

Further, the driving risk monitoring method further includes:

and sending the video of the driver shot by the camera into the neural network with complete training frame by frame, carrying out frame by frame analysis, and determining the corresponding predicted driving behavior classification of each frame.

Compared with the prior art, the invention has the beneficial effects that: firstly, effectively acquiring a plurality of driving behavior sample images to form a corresponding training sample set, namely a driving picture data set; and then, performing primary feature extraction by using a feature extractor, performing feature extraction from two dimensions of a channel and a space, effectively paying attention to local feature information in the picture, paying attention to key information of a driver in the picture, sending the key information to a top-level classifier, performing image classification, and finally judging which distraction behavior the driver in the input picture belongs to. In conclusion, the invention utilizes the improved neural network, combines the feature extractor and the attention layers with two dimensions, can greatly improve the accuracy of identifying the misjudgment rate and identify the distraction behavior in real time, can generalize to different scenes only by a small amount of samples, and more importantly, fully ensures the real-time property.

Drawings

FIG. 1 is a schematic view of an embodiment of an application system of a driving risk monitoring method based on an attention mechanism according to the present invention;

FIG. 2 is a schematic flow chart illustrating an embodiment of a driving risk monitoring method based on an attention mechanism according to the present invention;

FIG. 3 is a schematic structural diagram of a neural network according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an embodiment of step S1 in FIG. 2 according to the present invention;

FIG. 5 is a diagram illustrating a comparison of accuracy and parameter values of different network models according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating an embodiment of determining the feature matrix in step S2 of FIG. 2 according to the present invention;

FIG. 7 is a schematic structural diagram of an embodiment of an attention module according to the present invention;

FIG. 8 is a flowchart illustrating an embodiment of step S21 in FIG. 6 according to the present invention;

FIG. 9 is a schematic structural diagram of an embodiment of a center channel attention layer provided in the present invention;

FIG. 10 is a flowchart illustrating an embodiment of step S23 in FIG. 6 according to the present invention;

FIG. 11 is a schematic structural diagram of an embodiment of a middle attention layer provided in the present invention;

FIG. 12 is a flowchart illustrating an embodiment of step S3 in FIG. 2 according to the present invention;

FIG. 13 is a driving diagram illustrating one embodiment of driving behavior provided by the present invention;

fig. 14 is a schematic structural diagram of an embodiment of a driving risk monitoring device based on an attention mechanism according to the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. Further, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Reference throughout this specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the described embodiments can be combined with other embodiments.

The invention provides a driving risk monitoring method based on an attention mechanism, which utilizes an improved light-weight neural network to provide a new idea for further improving the real-time performance of network identification. The following are detailed below:

an embodiment of the present invention provides an application system of a driving risk monitoring method based on an attention mechanism, and fig. 1 is a scene schematic diagram of an embodiment of an application system of a driving risk monitoring method based on an attention mechanism provided by the present invention, where the system may include a server 100, and a driving risk monitoring device based on an attention mechanism, such as the server in fig. 1, is integrated in the server 100.

The server 100 in the embodiment of the present invention is mainly used for:

In this embodiment of the present invention, the server 100 may be an independent server, or may be a server network or a server cluster composed of servers, for example, the server 100 described in this embodiment of the present invention includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud server composed of a plurality of servers. Among them, the Cloud server is constituted by a large number of computers or web servers based on Cloud Computing (Cloud Computing).

It is to be understood that the terminal 200 used in the embodiments of the present invention may be a device that includes both receiving and transmitting hardware, i.e., a device having receiving and transmitting hardware capable of performing two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display. The specific terminal 200 may be a desktop, a laptop, a web server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, and the like, and the type of the terminal 200 is not limited in this embodiment.

Those skilled in the art can understand that the application environment shown in fig. 1 is only one application scenario of the present invention, and does not constitute a limitation on the application scenario of the present invention, and that other application environments may further include more or fewer terminals than those shown in fig. 1, for example, only 2 terminals are shown in fig. 1, and it can be understood that the application system of the driving risk monitoring method based on the attention mechanism may further include one or more other terminals, which is not limited herein.

In addition, as shown in fig. 1, the application system of the driving risk monitoring method based on the attention mechanism may further include a memory 200 for storing data, such as various driving behavior sample images, actual driving behavior classification, predicted driving behavior classification, and a well-trained neural network.

It should be noted that the scene schematic diagram of the application system of the driving risk monitoring method based on the attention mechanism shown in fig. 1 is only an example, the application system and the scene of the driving risk monitoring method based on the attention mechanism described in the embodiment of the present invention are for more clearly illustrating the technical solution of the embodiment of the present invention, and do not form a limitation on the technical solution provided in the embodiment of the present invention, and it can be known by those skilled in the art that the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems with the evolution of the application system of the driving risk monitoring method based on the attention mechanism and the appearance of a new business scene.

An embodiment of the present invention provides a driving risk monitoring method based on an attention mechanism, and referring to fig. 2, fig. 2 is a schematic flow chart of an embodiment of the driving risk monitoring method based on the attention mechanism provided in the present invention, and includes steps S1 to S3, where:

in step S1, acquiring a plurality of driving behavior sample images including label information to form a driving picture data set, wherein the label information is a corresponding actual driving behavior classification;

in step S2, inputting the driving picture data set to a feature extractor of a neural network for feature extraction, and then sequentially inputting the driving picture data set to a channel attention layer and a space attention layer of the neural network, forming a feature matrix, and then inputting the feature matrix to a top-level classifier of the neural network, and outputting a predicted driving behavior classification;

in step S3, the neural network is trained to converge according to the actual driving behavior classification and the predicted driving behavior classification, and a well-trained neural network is determined.

In the embodiment of the invention, firstly, a plurality of driving behavior sample images are effectively acquired to form a corresponding training sample set, namely a driving picture data set; and then, performing primary feature extraction by using a feature extractor, performing feature extraction from two dimensions of a channel and a space, effectively paying attention to local feature information in the picture, paying attention to key information of a driver in the picture, sending the key information to a top-level classifier, performing image classification, and finally judging which distraction behavior the driver in the input picture belongs to.

It should be noted that, referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of the neural network provided by the present invention, and it can be seen that a structure of an overall network model according to the embodiment of the present invention sequentially includes an input layer, an image enhancement layer, a preprocessing layer, a feature extraction layer (EfficientNet network, BN layer, and attention layer), and a classification layer (global average pooling layer, Dropout layer, density layer). Wherein the attention layer includes a channel attention layer and a spatial attention layer.

In the feature extraction layer, the EfficientNet and the attention layer are connected by the BN layer as a preferred embodiment. In the embodiment of the invention, the BN layer is added, so that overfitting of the neural network is prevented, and the accuracy of network training is ensured.

As a preferred embodiment, referring to fig. 4, fig. 4 is a schematic flowchart of an embodiment of step S1 in fig. 2 provided by the present invention, and step S1 specifically includes steps S11 to S13, where:

in step S11, a plurality of types of the driving behavior sample images including the annotation information are acquired;

in step S12, performing data enhancement processing on the plurality of driving behavior sample images to form corresponding data enhanced images, where the data enhancement processing includes at least one of random inversion, random rotation, and random cropping;

in step S13, the plurality of types of data-enhanced images are preprocessed to form the driving picture data set.

In the embodiment of the invention, data enhancement and data preprocessing are sequentially carried out, so that the overfitting degree of the model is reduced, and the generalization capability of the model is enhanced.

In a specific embodiment of the present invention, two Distracted driving picture data sets, i.e., the source data set State's displaced Driver Dataset and the AUC displaced Driver Dataset, are selected, and both data sets classify the driving behavior of the Driver into 10 classes. The picture is sent to the image enhancement layer, namely, the picture data set is subjected to random rotation, random inversion, random cutting, random image contrast adjustment and random translation. The image enhancement layer can reduce the overfitting degree of the model and enhance the generalization capability of the model;

the classification label provided by the invention specifically comprises the following components:

number 0: normal driving, Drive safe;

number 1: sending short message, Text right, by right hand;

number 2: right-hand call, Talk right;

number 3: sending short message, Text left, by left hand;

number 4: making a call with the left hand, Talk left;

number 5: controlling a vehicle-mounted radio, Adjust radio;

number 6: drinking, Drink;

number 7: get back to take the article, Reach before;

number 8: make-up or style Hair, Hair & make up;

number 9: talking on their side to the passenger, Talk passpasser.

As a preferred embodiment, the feature extractor adopts an EfficientNet network which removes the fully connected layer at the top layer of the model, and adopts a depth separable convolution module, an SE attention mechanism and a swish activation function.

In the embodiment of the invention, a new model scaling method is adopted to amplify the network from multiple dimensions, and meanwhile, local characteristics are noted under the condition of small parameter quantity, so that the precision is improved, and the convergence speed of the model is accelerated.

In a specific embodiment of the present invention, referring to fig. 5, fig. 5 is a schematic diagram illustrating the accuracy and parameter quantity comparison of different network models provided by the present invention, a new model scaling method is adopted, which uses a simple and efficient complex coefficient to scale up a network from depth, width, and resolution three dimensions, and does not scale the dimensions of the network arbitrarily as in the conventional method, and an optimal set of parameters (complex coefficients) can be obtained based on the neural Architecture search technology nas, which is typically represented by the EfficientNet of google; and obtaining an optimal network architecture by adopting a multi-dimensional mixed model scaling method. As shown in FIG. 5, compared with different network EfficientNet B0-B7 series, the pre-training model has more beneficial performance in parameter and precision; the Imagenet pre-training model EfficientNet is used as a feature extraction layer of the model, and a full connection layer at the top layer of the model is removed, wherein the full connection layer is used for classifying specific classification tasks; the internal structure of the EfficientNet largely adopts a depth separable convolution module and an SE attention mechanism, and adopts a swish activation function, compared with the traditional network, the network adopting the depth separable convolution module can greatly reduce the parameter quantity and the calculated quantity, and the SE module can pay attention to local characteristics under the condition of small parameter quantity, thereby improving the precision and accelerating the convergence speed of the model. And compared with the traditional ReLu activating function, the swish activating function is adopted to obtain higher precision.

As a preferred embodiment, referring to fig. 6, fig. 6 is a schematic flow chart of an embodiment of determining a feature matrix in step S2 in fig. 2 provided by the present invention, where step S2 specifically includes steps S21 to S24, where:

in step S21, inputting the mixture feature extracted by the feature extractor to the channel attention layer, and determining a channel attention weight;

in step S22, multiplying the channel attention weight by the mixed feature to determine a channel weighted feature;

in step S23, inputting the channel weighting feature into the spatial attention layer, and determining a spatial attention weight;

in step S24, the channel weighted feature and the spatial attention weight are multiplied to determine the final feature matrix.

In the embodiment of the invention, the key information in the mixed feature map is extracted from two dimensions, so that the local information in the picture can be effectively concerned, and a better identification effect can be obtained.

It should be noted that, in the field of image recognition, especially in the aspect of driver distraction, the conventional convolutional network can only extract features of an indoor scene of the driver shot by the camera from the whole, and is often susceptible to interference of various information irrelevant to the driver in the picture. The feature information extracted by convolution is sent to the attention layer, so that local information in the picture, namely the fine movement of a driver in the picture can be effectively focused. If the driver moves hands when calling, and moves the mouth when speaking. Peripheral background information, such as a large amount of exposure of the glass of the main driving position, a swinging part of the center console and the like, can be ignored;

the features extracted by the EfficientNet are abstract features from a shallow layer to a deep layer, and only the abstract features need to be sent to an attention layer to focus on the abstract features. The Convolutional Attention Module (CBAM) is a kind of Attention Module combining space (spatial) and channel (channel). Better results can be achieved compared to the attention mechanism of senet focusing only on channels.

In a specific embodiment of the present invention, referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of an attention module provided in the present invention, where a feature input layer is first sent to a channel attention module of a CBAM to obtain a channel attention weight, and multiplied by an original input feature to obtain a weighting result, and then the weighting result is obtained by calculating a space attention weight through a space attention module, and finally the weighting is performed to obtain a result.

As a preferred embodiment, referring to fig. 8, fig. 8 is a schematic flowchart of an embodiment of step S21 in fig. 6 provided by the present invention, and step S21 specifically includes steps S211 to S213, where:

in step S211, inputting the mixed features into a channel global maximum pooling layer and a channel global average pooling layer in the channel attention layer, respectively, and determining corresponding channel maximum features and channel average features;

in step S212, inputting the channel maximum feature and the channel average feature into a fully connected layer in the channel attention layer;

in step S213, the output features of the fully-connected layers in the channel attention layer are correspondingly added element by element, and then the channel attention weight is determined through an activation function.

In the embodiment of the invention, based on the channel attention mechanism, the channel attention feature weight matrix is multiplied by the feature map of the original input to generate the final channel attention feature matrix weight.

In a specific embodiment of the present invention, referring to fig. 9, fig. 9 is a schematic structural diagram of an embodiment of the center channel attention layer provided in the present invention. The method comprises the following steps of respectively performing global maximum pooling and global average pooling on input feature graphs based on width and height of the feature graphs, respectively performing element-by-element corresponding addition on features output by a full connection layer through the full connection layer, generating final channel attention feature matrix weight through a sigmoid activation function, and expressing the final channel attention feature matrix weight through the following formula:

wherein sigma is sigmoid function activation operation, and r represents full-connection layer hidden layer unit reductionRate, w₀Followed by RELU function activation.

As a preferred embodiment, the attention module is added at the convolution layer of the model, so that a better classification effect can be obtained, and the model can be converged more quickly. However, in order to avoid excessive model parameter quantity and improve the model speed, the attention mechanism module is only added at the top layer of the model.

As a preferred embodiment, referring to fig. 10, fig. 10 is a schematic flowchart of an embodiment of step S23 in fig. 6 provided by the present invention, and step S23 specifically includes steps S231 to S233, where:

in step S231, the channel weighting features are respectively input into a spatial global maximum pooling layer and a spatial global average pooling layer in the spatial attention layer, and corresponding spatial maximum features and spatial average features are determined;

in step S232, performing a splicing operation on the spatial maximum feature and the spatial average feature based on a channel dimension, and determining a spliced feature map;

in step S233, the spliced feature map is subjected to convolution operation, and after a feature map with a single channel dimension is formed through dimensionality reduction, the spatial attention weight is generated through an activation function.

In the embodiment of the invention, the channel attention feature weight matrix is multiplied by the originally input feature map to obtain the channel attention feature map, the channel attention feature map is sent to the space attention module, the local information is further extracted on the space, and the identification efficiency is improved.

In an embodiment of the invention, referring to fig. 11, fig. 11 is a schematic structural diagram of an embodiment of a middle attention layer provided in the invention. Firstly, respectively performing global maximum pooling and global average pooling on feature maps based on channels, and performing splicing operation on the two result feature maps based on channel dimensions to obtain spliced feature maps. And then, reducing the dimension into a feature map with a single channel dimension through a convolution operation, setting a convolution kernel to be 7 × 7, and padding filling operation to be same as the original feature map in size. And finally, generating a space attention weight characteristic matrix through a sigmoid activation function, multiplying the characteristic and a channel attention characteristic diagram input into a space attention module to obtain a finally generated characteristic matrix, and expressing the characteristic matrix through the following formula:

where σ is a sigmoid operation, 7 × 7 represents the size of the convolution kernel, and a convolution kernel of 7 × 7 works better than a convolution kernel of 3 × 3.

In a preferred embodiment, the top classifier comprises a global average pooling layer and at least one fully connected layer in sequence. In the embodiment of the invention, after passing through the global average pooling layer, the classification is carried out through at least one full-connection layer. It should be noted that, a convolution kernel with a convolution kernel size of 1 × 1 may be used, and a specific method is to have several classes, that is, several 1 × 1 convolution kernels (for example, 10 classes are provided, and 10 1 × 1 convolution kernels are provided) are used, that is, the global average pooling layer +1 × 1 convolution class; or directly using a full connection layer, wherein the number of the full connection layer can be more than one or one, but the neuron of the last full connection layer must be the classified category number.

As a more specific embodiment, the top-level classifier includes a global average pooling layer, a first Dropout layer, a first density layer, a second Dropout layer, and a second density layer in this order. In the embodiment of the invention, a top-level classifier is constructed to predict and classify the pictures and finally judge which type of distraction the input pictures belong to. It should be noted that the feature map obtained by the attention module can pay attention to the key information of the driver in the picture, and the key information is sent to the global average pooling layer, the Dropout layer and the sense layer for image classification, and finally, which type of distraction the input picture belongs to is judged.

As a preferred embodiment, referring to fig. 12, fig. 12 is a schematic flowchart of an embodiment of step S3 in fig. 2 provided by the present invention, and step S3 specifically includes steps S31 to S32, where:

in step S31, keeping the model parameters of the feature extractor unchanged, iteratively training the model parameters of the channel attention layer, the spatial attention layer, and the top-level classifier until the network converges, and determining a corresponding first training network;

in step S32, training the model parameters of the feature extractor in the first training network, and training the model parameters of the feature extractor, the channel attention layer, the spatial attention layer, and the top-level classifier iteratively until the network converges, so as to determine the well-trained neural network.

In the embodiment of the invention, the mode of freezing the basic characteristics and then unfreezing the basic model is adopted, so that the model training speed is improved, and the model precision is ensured.

In a specific embodiment of the present invention, the model training is divided into the following two steps:

the first step is to freeze the weight of a basic feature extraction model Efficiennet model and train an attention layer classifier and a top layer classifier. The optimizer does not need to employ RMSprop, as RMSprop is easily corrupted and trained with weights for migratory learning and amplifies the penalty. An adam optimizer may be employed. The use of smaller batch training is more beneficial to the accuracy improvement of the verification set. Because the regularization effect of smaller batches is more pronounced;

and secondly, unfreezing the weight of the basic model Efficiennet and training the parameters of all networks. The learning rate is set small to fine-tune the entire network, and is typically set to 0.0001 or even less, e.g., 0.00001.

As a preferred embodiment, during the model training process, the hyper-parameters of the model are optimized by adopting a hyper-band hyper-parameter adjustment optimization algorithm. In the embodiment of the invention, the performance of the whole network is improved by using a Hyperband hyper-parameter adjustment optimization algorithm so as to adapt to the requirement of real-time deployment.

It should be noted that the conventional hyper-parameter adjusting method is Grid Search (GS) and Random Search (RS). However, the GS method and the RS method are generally classified as blind search, and the Bayesian Optimization (BO) algorithm can well draw previous experience of the hyper-parameters and select the combination of the hyper-parameters at the next time more quickly and efficiently, and has the disadvantages that for high-dimensional and non-convex functions with unknown smoothness and noise, the BO algorithm is difficult to fit and optimize, and generally has strong assumed conditions which are generally difficult to meet. The Hyperband can make trade-offs by considering factors such as time and computing resources and the number of hyper-parameter combinations. The method can preset the number of the hyperparameter combinations as much as possible, and the budget allocated to each group of hyperparameters is also as much as possible, so that the optimal hyperparameters are found as much as possible. Therefore, in the model training process, the hyper-parameters of the model are optimized by adopting a hyper-band hyper-parameter adjustment optimization algorithm, and the hyper-parameter search space comprises the number of DENSE layer neurons at the top layer, the learning rate, the training batch size, the attention layer kernel size and the like.

As a preferred embodiment, the driving risk monitoring method further includes:

In the embodiment of the invention, the video frame to be tested is input, the network analyzes frame by frame, and the early warning of corresponding classification is automatically given.

In an embodiment of the present invention, referring to fig. 13, fig. 13 is a driving diagram of an embodiment of driving behaviors provided by the present invention, and the attention mechanism of the network allows the neural network to pay more attention to the fine motion in the picture. After the pictures are sent to the network, a gradcam method is adopted to visualize the characteristic diagram which is noticed by attention, and the position with more vivid color represents the important characteristic of the pictures which can be noticed by the network.

An embodiment of the present invention further provides a driving risk monitoring device based on an attention mechanism, and referring to fig. 14, fig. 14 is a schematic structural diagram of an embodiment of the driving risk monitoring device based on the attention mechanism provided in the present invention, and the driving risk monitoring device based on the attention mechanism includes:

an obtaining unit 1401, configured to obtain multiple driving behavior sample images including labeled information to form a driving picture data set, where the labeled information is a corresponding actual driving behavior classification;

the processing unit 1402 is configured to input the driving picture data set to a feature extractor of a neural network for feature extraction, sequentially input the driving picture data set to a channel attention layer and a space attention layer of the neural network, form a feature matrix, input the feature matrix to a top-level classifier of the neural network, and output a predicted driving behavior classification;

a training unit 1403, configured to train the neural network to converge according to the actual driving behavior classification and the predicted driving behavior classification, and determine a neural network with complete training.

For a more specific implementation manner of each unit of the driving risk monitoring device based on the attention mechanism, reference may be made to the description of the driving risk monitoring method based on the attention mechanism, and similar beneficial effects may be obtained, and details are not described herein again.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the driving risk monitoring method based on the attention mechanism as described above.

Generally, computer instructions for carrying out the methods of the present invention may be carried using any combination of one or more computer-readable storage media. Non-transitory computer readable storage media may include any computer readable medium except for the signal itself, which is temporarily propagating.

A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages, and in particular may employ Python languages suitable for neural network computing and TensorFlow, PyTorch-based platform frameworks. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The embodiment of the invention also provides a computing device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the driving risk monitoring method based on the attention mechanism is realized.

According to the computer-readable storage medium and the computing device provided by the above embodiments of the present invention, the content specifically described for implementing the driving risk monitoring method based on the attention mechanism according to the present invention can be referred to, and the beneficial effects similar to those of the driving risk monitoring method based on the attention mechanism described above are obtained, and are not described herein again.

The invention discloses a driving risk monitoring method based on an attention mechanism, which comprises the following steps of firstly, effectively obtaining a plurality of driving behavior sample images to form a corresponding training sample set, namely a driving picture data set; and then, performing primary feature extraction by using a feature extractor, performing feature extraction from two dimensions of a channel and a space, effectively paying attention to local feature information in the picture, paying attention to key information of a driver in the picture, sending the key information to a top-level classifier, performing image classification, and finally judging which distraction behavior the driver in the input picture belongs to.

According to the technical scheme, the improved neural network is utilized, the feature extractor and the attention layers with two dimensions are combined, the accuracy of identifying the misjudgment rate can be greatly improved, the distracted driving behaviors can be identified in real time, only a small number of samples are needed, different scenes can be generalized, and more importantly, the real-time performance is fully guaranteed; the method has the advantages that higher accuracy can be obtained on the basis of greatly reducing network model parameters on the basis of the depth, width and image resolution of a network, and the model has the capability of processing video frame information in real time due to fewer parameters and calculated amount and can be practically deployed on hardware; the application of the attention mechanism can enable the model to have strong robustness, improve the generalization capability of the model, and greatly reduce the misjudgment rate compared with the traditional convolutional neural network.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. a driving risk monitoring method based on attention mechanism, is characterized in that, comprises:

Acquiring a variety of driving behavior sample images including annotation information to form a driving picture data set, wherein the annotation information is the corresponding actual driving behavior classification;

After inputting the driving picture data set to the feature extractor of the neural network for feature extraction, it is sequentially input to the channel attention layer and the spatial attention layer of the neural network to form a feature matrix and then input to the top layer of the neural network. Classifier, output predicted driving behavior classification;

According to the actual driving behavior classification and the predicted driving behavior classification, the neural network is trained to converge, and a fully trained neural network is determined.

2. The driving risk monitoring method based on an attention mechanism according to claim 1, wherein the obtaining of the multiple driving behavior sample images of the annotation information to form a driving picture data set comprises:

acquiring a plurality of sample images of the driving behavior including the annotation information;

performing data enhancement processing on a plurality of the driving behavior sample images to form corresponding data enhancement images, wherein the data enhancement processing method includes at least one of random flip, random rotation, and random clipping;

A plurality of the data-enhanced images are preprocessed to form the driving picture data set.

3. the driving risk monitoring method based on attention mechanism according to claim 1, is characterized in that, described feature extractor adopts the EfficientNet network that removes the fully connected layer of the model top layer, adopts the depth separable convolution module, SE Attention mechanism and swish activation function.

4. The driving risk monitoring method based on an attention mechanism according to claim 1, wherein the channel attention layer and the spatial attention layer sequentially input to the neural network comprise:

inputting the mixed features extracted by the feature extractor into the channel attention layer to determine the channel attention weights;

Multiplying the channel attention weight and the mixed feature to determine the channel weighted feature;

Inputting the channel weighted feature to the spatial attention layer to determine the spatial attention weight;

Multiply the channel weighted feature and the spatial attention weight to determine the final feature matrix.

5 . The driving risk monitoring method based on an attention mechanism according to claim 4 , wherein the mixed feature extracted by the feature extractor is input into the channel attention layer, and the channel attention weight is determined. 6 . include:

Inputting the mixed features into the channel global maximum pooling layer and the channel global average pooling layer in the channel attention layer respectively, and determining the corresponding channel maximum feature and channel average feature;

Input the channel maximum feature and the channel average feature to the fully connected layer in the channel attention layer;

The output features of the fully connected layer in the channel attention layer are added correspondingly element by element, and then the channel attention weight is determined through an activation function.

6 . The driving risk monitoring method based on an attention mechanism according to claim 4 , wherein the inputting the channel weighted feature into the spatial attention layer, and determining the spatial attention weight comprises: 6 .

The channel weighted features are respectively input into the spatial global maximum pooling layer and the spatial global average pooling layer in the spatial attention layer, and the corresponding spatial maximum features and spatial average features are determined;

The splicing operation is performed on the space maximum feature and the spatial average feature based on the channel dimension, and the feature map after splicing is determined;

The convolution operation is performed on the spliced feature map to reduce the dimension to form a feature map with a single-channel dimension, and then an activation function is performed to generate the spatial attention weight.

7 . The driving risk monitoring method based on an attention mechanism according to claim 1 , wherein the top-level classifier sequentially comprises a global average pooling layer and at least one fully connected layer. 8 .

8 . The driving risk monitoring method based on an attention mechanism according to claim 1 , wherein the training of the neural network to convergence, and determining a well-trained neural network comprises: 8 .

Keep the model parameters of the feature extractor unchanged, iteratively train the model parameters of the channel attention layer, the spatial attention layer and the top-level classifier until the network converges, and determine the corresponding first training network;

training the model parameters of the feature extractor in the first training network while iteratively training models for the feature extractor, the channel attention layer, the spatial attention layer, and the top-level classifier parameters until the network converges to determine the fully trained neural network.

9 . The driving risk monitoring method based on the attention mechanism according to claim 8 , wherein, in the model training process, the Hyperband hyperparameter adjustment and optimization algorithm is used to optimize the hyperparameters of the model. 10 .

10. The driving risk monitoring method based on an attention mechanism according to claim 1, wherein the driving risk monitoring method further comprises:

The driver video captured by the camera is sent to the well-trained neural network frame by frame, and frame by frame analysis is performed to determine the predicted driving behavior classification corresponding to each frame.