CN110298257B

CN110298257B - Driver behavior recognition method based on human body multi-part characteristics

Info

Publication number: CN110298257B
Application number: CN201910483000.4A
Authority: CN
Inventors: 路小波; 陆明琦; 张德明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2023-08-01
Anticipated expiration: 2039-06-04
Also published as: CN110298257A

Abstract

The invention provides a driver behavior recognition method based on human body multi-part characteristics, which comprises the following steps: establishing an image data set for identifying the behavior of a driver; constructing a neural network model; training a driver behavior recognition model based on the multi-part characteristics of the human body; and testing the driver behavior recognition model based on the human body multi-part characteristics. The invention combines the human body key point area and the global area, utilizes certain local areas with discriminant ability to perform driver behavior recognition, further improves the accuracy of driver behavior recognition due to the fusion of human body key point characteristics, and has important application value in the field of traffic safety.

Description

Driver behavior recognition method based on human body multi-part characteristics

Technical Field

The invention belongs to the field of image processing, relates to a mode identification method, and particularly relates to a static driver behavior identification method based on human body multi-part characteristics.

Background

In recent years, with the rapid development of Chinese economy, the living standard of residents is continuously improved, and automobiles become an indispensable transportation means in the production of human life. However, with the popularization of automobiles, road traffic accidents also frequently occur.

The latest statistics data published by the national security administration and transportation department show that the annual death number of road security accidents in China is still in the second place in the world, wherein the traffic accidents caused by private cars and trucks account for about 80% of the total national quantity. Traffic safety currently faces challenges such as traffic participant illicit highlighting; management and law enforcement are not in place; the problems of insufficient road safety supervision and the like, and unsafe behavior of a driver is an important reason for brewing traffic accidents, and the facts indicate that the high occurrence of the road traffic accidents is restrained, the behavior of the driver is regulated, and the danger of the traffic accidents is reduced.

Unsafe driver behaviors are often divided into the following types, on one hand, the current life rhythm is continuously accelerated, the life of people is closely related to a mobile phone, a driver frequently takes a mobile phone in the driving process, views and sends information and the like, and the behaviors can lead to the situation that the driver's eyes are separated from the road ahead, the hands leave the steering wheel, the attention is not concentrated and the like. Once an emergency occurs, it is often difficult for a driver to quickly take countermeasures, so that serious traffic accidents may be brewed. In addition, some drivers often have the actions of smoking, talking with co-driver passengers, getting out of the steering wheel with hands and the like, which have potential safety hazards, in the long-time driving process, so that the occurrence rate of traffic accidents is improved to a great extent. Recent regulations for road traffic safety regulations have clearly specified that there must be actions such as making and receiving hand-held calls during driving of motor vehicles that hamper safe driving.

The driver behaviors with potential safety hazards are difficult to monitor and manage by related departments, and traffic managers cannot realize real-time manual supervision. The unsafe behavior is stopped to be greatly dependent on the safety consciousness of the driver, and the behavior of the driver is standardized and has no effective measure.

Disclosure of Invention

In order to solve the problems, the invention provides a driver behavior recognition method based on human body multi-part characteristics, and the used human body multi-part characteristic extraction method can obtain the spatial information of the driver behavior in an image, so that the driver behavior is recognized in real time. Because of different driver behaviors and different actions of local parts of the human body, the human body multi-part characteristic is utilized, and the human body key point positioning method Convolutional Pose Machines is combined with the VGG model to recognize the driver behaviors.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a driver behavior recognition method based on human body multi-part features comprises the following steps:

step 1: creating an image dataset for driver behavior recognition

Acquiring sample image data, establishing an image data set, wherein the sample image contains various driver behaviors, dividing the image data set into a training set and a test set, and the driver in the test sample image is independent of the driver in the training sample;

step 2, constructing a neural network model

The neural network model consists of a Convolutional Pose Machines model and a convolutional neural network, wherein all connection layers for extracting the characteristics of a local area and a global area in the convolutional neural network are mutually independent, and the convolutional layers are shared;

step 3: training of a driver behavior recognition model based on human body multi-part features

Building a network model, training the network model, and optimizing network parameters by using a random gradient descent method;

step 4: testing a driver behavior recognition model based on human body multi-part characteristics

Giving a driver behavior image, normalizing the size of the test image, taking the normalized size as the input of a model, and obtaining the behavior recognition result of the test image through forward propagation.

Further, the step 2 specifically includes the following steps:

step 201: labeling 5 key points in the training sample and the test sample, wherein the key points comprise: head, right hand, right elbow joint, left hand, and left elbow joint, the training sample and test sample images each include various driver behaviors;

step 202: definition Y _p The E Z is the pixel position of the p-th key point, wherein Z represents the set of all positions (u, v) in the picture, and the target of positioning the key points of the human body is the position Y= (Y) of all key points in the predicted image ₁ ,...,Y _P ) The gesture machine is composed of a multi-class classifier sequence g _t (-) constitution;

at each stage T e {1,..and, T }, for different key point positions Y of human body _p =z, Z e Z, classifier g _t (. Cndot.) feature x extracted based on image location z _z Y output by the classifier in the previous stage _p Outputting a confidence value of related information near the field; in particular, for the first phase, i.e. t=1, classifier g ₁ The confidence value of the (-) output is shown in formula (1):

wherein the method comprises the steps ofFor classifier g ₁ The fraction of the p-th keypoint output in stage 1 at image position z;

define each position z= (u, v) in the image ^T Confidence value of key point p of (a) isOmega and h are the width and height of the image respectively, then one can get

The confidence map set of all human body key points is defined as b _t ∈R ^ω*h*(P+1) "P+1" represents P human keypoints and backgrounds;

in a subsequent stage, for the keypoint p, the classifier will be based on the feature information of the input imageAnd the relevant feature information output by the previous stage classifier represented by formula (2) gives a confidence value, namely, the following formula (3):

wherein psi is _t>1 (. Cndot.) is confidence set b _t-1 Mapping to a feature value; x 'is the image feature, in the original pose machine architecture, x=x';

step 203: adopting a CPM half-body network model, wherein the model comprises the following four stages:

the first stage consists of a basic convolutional neural network, and comprises a white convolutional module consisting of 7 convolutional layers and 3 pooling layers, wherein the convolutional module belongs to a full convolutional structure, and the convolutional layers do not change the size of an image; after three pooling operations are carried out on the input image, a response diagram is finally output for each key point, and the total number of the input image is P+1;

the second stage takes the convolution result of the original picture, the output response diagram of the first stage and the central constraint fusion generated by a Gaussian template as input through a serial structure, and the final output is P+1 response diagrams;

the third stage and the fourth stage replace the convolution result of the original picture with the staged convolution result of the second stage, and other parts remain unchanged;

step 204: carrying out loss calculation and error propagation on the output of each stage, calculating a response graph of each key point under each scale, and accumulating the response graphs as a final total response graph;

step 205: after the positions of the key points are obtained, drawing corresponding rectangular areas according to the positions;

step 206: performing forward computation on the global image through VGG-16 to obtain a corresponding feature vector, and mapping the key point area and the feature vector by using a Rol Pooling layer;

step 207: and cascading the 5 key point feature vectors and the global feature vector to serve as final feature vectors, classifying the final feature vectors by using softmax, and outputting corresponding driver behaviors.

Further, the subsequent stage of step 202 uses x' different from the first stage, classifier g _t (. Cndot.) random forest algorithm was used.

Further, in step 203, when the whole body keypoints are to be detected, the third stage structure is repeated.

Further, the step 3 specifically comprises Convolutional Pose Machines model training and VGG-16 model training;

Convolutional Pose Machinesin the training of the model, a correct response diagram of a certain key point p is set asThe response pattern output in the model is +.>The Loss function for each stage is then:

the Loss at four stages is:

training of VGG-16 reduces loss of softmax layer based on behavioral labels of the samples; if P (α|I, r) is the probability that the driver behavior given by softmax belongs to α, then for one batch training sample, the loss function is:

wherein l _i For image I _i M is the number of latches.

Further, the step 3VGG-16 is used for initializing parameters by using a model trained on an ImageNet-1K data set during training.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention combines the human body key point area and the global area, utilizes certain local areas with discriminant ability to perform driver behavior recognition, further improves the accuracy of driver behavior recognition due to the fusion of human body key point characteristics, and has important application value in the field of traffic safety.

2. According to the invention, convolutional Pose Machines model is used for extracting the information of a plurality of key point areas of the human body, so that the recognition accuracy of the model is remarkably improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Figure 2 is a graphical representation of examples of different driver behaviors in the present invention,

figure 3 is a schematic diagram of a driver behavior recognition model framework based on human body multi-part characteristics in the invention,

figure 4 is a diagram of the attitude machine architecture in the present invention,

FIG. 5 is a schematic view of the CPM model structure according to the present invention,

figure 6 is a schematic diagram of a convolution module structure in the present invention,

fig. 7 is a schematic diagram of relay supervision in the present invention.

Detailed Description

The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

In the invention, a serialization gesture machine architecture is adopted to detect five key point positions of a head, a left elbow joint, a right elbow joint, a left wrist and a right wrist, and a local area is drawn. And then, carrying out convolution calculation on the global image through the VGG-16 model to obtain corresponding feature vectors, and mapping the key point areas and the feature vectors by means of the Rol Pooling layer. Finally, the 5 feature vectors and the global feature vector are cascaded, classified by softmax, and corresponding driver behaviors are output.

Specifically, the method for identifying the driver behavior based on the human body multi-part characteristics provided by the invention is shown in a figure 1, and comprises the following steps:

step 1: an image dataset of driver behavior recognition is established.

The sample data source and two parts, wherein one part is from a driver behavior data set provided by a Kagle platform, the picture size is 640 x 480, and the total is 25000, such as non-Chinese driver images in FIG. 2; the other part is a self-built driver behavior database, and the built-in vehicle-mounted camera is recorded under different angles and different light conditions, and the model of the camera is Logitech C920. The shot size is 1320 x 946, which is cut into 640 x 480 for unifying data, such as the chinese driver image of fig. 2, for a total of about 5000, and the sample numbers of 10 behaviors are substantially identical, respectively: normal driving, left-hand phone call, right-hand phone call, left-hand information receiving and sending, right-hand information receiving and sending, left-hand smoking, right-hand smoking, drinking water, talking with co-driver passengers and double hands being separated from the steering wheel.

The shot picture data set is divided into a training set and a test set, wherein each training set and the test set comprise 29000 training pictures and 1000 test pictures. The original pictures are downsampled to 368 x 368, and the behavior labels corresponding to the samples are represented by 0 to 9. For accuracy, the test samples covered 10 driver behaviors, 100 for each, and the driver in the test sample picture was independent of the driver in the training sample.

Step 2: and constructing a neural network model.

The model designed in the step is composed of a Convolutional Pose Machines model and a convolutional neural network, and the structural schematic diagram is shown in fig. 3. The convolutional neural network module adopts a VGG-16 model, and the full connection layers of the local area and the global area feature extraction are mutually independent and the convolutional layers are shared. In order to increase the processing speed of the network model, a Rol Pooling network layer is introduced. The specific construction process is as follows:

step 201: because of the lack of a public driver keypoint labeling dataset, about 10000 samples were manually labeled, about 1000 per driver behavior. The test samples were 600 additional sheets, 100 sheets for each run. As shown in fig. 3, a total of 5 key points are labeled, namely, head, right hand, right elbow joint, left hand and left elbow joint.

Step 202: definition Y _p E Z is the pixel location of the p-th keypoint, where Z represents the set of all locations (u, v) in the picture. The target of the human body key point positioning is the position Y= (Y) of all key points in the predicted image ₁ ,...,Y _P ). The gesture machine is composed of a multi-category classifier sequence g _t (. Cndot.) is constructed as shown in FIG. 4.

At each stage T e {1,..and, T }, for different key point positions Y of human body _p =z, Z e Z, classifier g _t (. Cndot.) feature x extracted based on image location z _z Y output by the classifier in the previous stage _p Relevant information in the vicinity of the domain outputs a confidence value. In particular, for the first phase, i.e. t=1, classifier g ₁ The confidence value of the (-) output is shown in formula 1.

Wherein the method comprises the steps ofFor classifier g ₁ The fraction of the p-th keypoint output in stage 1 at image position z. Define each position z= (u, v) in the image ^T The confidence value of the key point p of (2) is +.>Omega and h are the width and height of the image, respectively, then

For ease of representation, the confidence map set for all human keypoints is defined as b _t ∈R ^ω*h*(P+1) "P+1" represents P human keypoints and backgrounds.

In a subsequent stage, for the keypoint p, the classifier will be based on the feature information of the input imageAnd the relevant feature information output by the previous stage classifier represented by equation 2 gives a confidence value, i.e., as in equation 3.

Wherein psi is _t>1 (. Cndot.) is confidence set b _t-1 Mapping to eigenvalues. As t increases, the confidence value output by the classifier is more and more accurate for each key point. In addition, the image features x' used in the subsequent stages may be different from those in the first stage. In the original pose machine architecture, x=x', while classifier g _t The (-) use is a random forest algorithm.

Step 203: since the key points required to be detected for driver behavior recognition are concentrated on the upper body of the driver, the present invention adopts a body network model, and the CPM body model is divided into four stages as shown in FIG. 5.

The first phase of the CPM model consists of a basic convolutional neural network, i.e., a white convolutional module. The module consists of 7 convolutional layers, 3 pooling layers, as shown in fig. 6. The convolution module belongs to a full convolution structure, wherein the convolution layer does not change the size of the image. The size of the input image is 368 x 368, after three pooling operations, a response map is finally output for each key point, and the total of P+1 images is 46 x 46.

The second stage of the model takes the convolution result of the original picture, the output response diagram of the first stage and the central constraint fusion generated by a Gaussian template as input through a serial structure, and the final output is also P+1 response diagrams, and the size of each response diagram is 46 x 46. The center constraint generated by the gaussian template is the center graph module in fig. 5, which functions to focus the response mainly to the center of the image.

The third and fourth stages replace the convolution result of the original picture with the staged convolution result of the second stage, and the other parts remain unchanged. If a more complex network structure is to be designed, for example, when the key points of the whole body need to be detected, the third-stage structure only needs to be repeated.

Step 204: in order to solve the gradient vanishing problem, a mechanism of relay supervision is introduced in the present invention, namely, loss calculation and error propagation are performed on the output of each stage, as shown in fig. 7. In addition, the invention also expands the data in multiple scales, namely, under each scale, calculates the response graph of each key point, and accumulates the response graph as the final total response graph.

Step 205: and after the positions of the key points are obtained, drawing a corresponding rectangular area according to the positions. Defining the key point areas extracted in the step 2 as r respectively _head ，r _left-hand ，r _right-hand ，r _left-elbow ，r _right-elbow 。

Step 206: and performing forward computation on the global image through VGG-16 to obtain a corresponding feature vector, and mapping the key point area and the feature vector by using the Rol mapping layer.

Step 3: training a driver behavior recognition model based on the human body multi-part characteristics.

The network model was built using the Caffe open source tool, and the entire network model training process was run on the Intel Core I7 server using the NVIDIA TITANX GPU, ubuntu 18.04 operating system. Network parameters are optimized using a random gradient descent method.

Training is largely divided into a Convolutional Pose Machines model and a VGG-16 model. In the training of Convolutional Pose Machines model, the correct response diagram of a certain key point p is set asThe response pattern output in the model is +.>The Loss function for each stage is then:

the Loss at four stages is:

while VGG-16 was trained to reduce loss of softmax layer based on the behavioral signature of the samples. If P (α|I, r) is the probability that the driver behavior given by softmax belongs to α, then for one batch training sample, the loss function is:

wherein l _i For image I _i Is a correct behavior tag of (1). M is the number of latches. And initializing parameters by using the model trained on the ImageNet-1K data set during training so as to accelerate convergence of the model. The training learning rate was 0.0001, the batch size was 32, and the number of iterations was about 7000.

Step 4: and testing the driver behavior recognition model based on the human body multi-part characteristics. Given a driver behavior image, the test image is normalized to 368×368 size as the input of the model, and the behavior recognition result of the test image is obtained through forward propagation.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The driver behavior recognition method based on the human body multi-part characteristics is characterized by comprising the following steps of:

step 1: creating an image dataset for driver behavior recognition

step 2, constructing a neural network model

The neural network model consists of a Convolutional Pose Machines model and a convolutional neural network, the convolutional neural network module adopts a VGG-16 model, and full connection layers for extracting the characteristics of the local area and the global area are mutually independent and shared by the convolutional layers; the method specifically comprises the following steps:

step 202: definition Y _p The E Z is the pixel position of the p-th key point, wherein Z represents the set of all positions (u, v) in the picture, and the target of positioning the key points of the human body is the position Y= (Y) of all key points in the predicted image ₁ ,…,Y _P ) The gesture machine is composed of a multi-class classifier sequence g _t (-) constitution;

in each stage T E {1, …, T }, for different human body key point positions Y _p =z, Z e Z, classifier g _t (. Cndot.) feature x extracted based on image location z _z Y output by the classifier in the previous stage _p Outputting a confidence value of related information near the field; in particular, for the first phase, i.e. t=1, classifier g ₁ The confidence value of the (-) output is shown in formula (1):

define each position z= (u, v) in the image ^T Confidence value of key point p of (a) isOmega and h are images respectivelyIs then able to obtain

step 206: performing forward computation on the global image through VGG-16 to obtain a corresponding feature vector, and mapping the key point area and the feature vector by using the RoI Pooling layer;

step 207: cascading the 5 key point feature vectors and the global feature vector to serve as final feature vectors, classifying the final feature vectors by using softmax, and outputting corresponding driver behaviors;

2. The method for identifying driver behavior based on multiple human body features according to claim 1, wherein x 'used in the subsequent stage of step 202 is different from x' used in the first stage, and the classifier g is _t (. Cndot.) random forest algorithm was used.

3. The method for identifying the behavior of the driver based on the multi-part characteristics of the human body according to claim 1, wherein the step 3 specifically comprises Convolutional Pose Machines model training and VGG-16 model training;

in the training of Convolutional Pose Machines model, the correct response diagram of a certain key point p is set asThe response pattern output in the model is +.>The Loss function for each stage is then:

the Loss at four stages is:

wherein l _i For image I _i M is the number of latches.

4. The method for identifying the behavior of the driver based on the multi-part characteristics of the human body according to claim 1, wherein the step 3 is characterized in that the VGG-16 training is performed by using a model trained on an ImageNet-1K data set for parameter initialization.