CN111582086A

CN111582086A - Fatigue driving identification method and system based on multiple characteristics

Info

Publication number: CN111582086A
Application number: CN202010338222.XA
Authority: CN
Inventors: 胡峰松; 彭清舟; 徐蓉; 程哲坤
Original assignee: Hunan University; CERNET Corp
Current assignee: Hunan University; CERNET Corp
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2020-08-25

Abstract

The invention discloses a fatigue driving identification method and a system based on multiple characteristics, wherein the method and the system preprocess images, not only filter out noise, but also avoid the problems of poor image quality and low detection precision caused by the influence of external environment factors on the images; the AdaBoost algorithm can be used for stably, quickly and efficiently detecting the human face, so that the complexity of human face detection is reduced; the scale space-based facial target tracking algorithm adopts a self-adaptive high-confidence updating strategy, when an error occurs in a target tracking stage, the confidence of target detection is low, and a model is not updated, so that the risk of drifting of the tracking algorithm is effectively reduced, and the tracking precision is improved; the eye state recognition is carried out by adopting the SVM classifier, so that the accuracy of the eye state recognition is improved, and the method is high in recognition accuracy and strong in adaptability to the environment.

Description

Fatigue driving identification method and system based on multiple characteristics

Technical Field

The invention belongs to the technical field of driving safety, and particularly relates to a fatigue driving identification method and system based on multiple characteristics.

Background

Nowadays, the fatigue detection technology of drivers is more and more mature, and the fatigue detection method can be mainly divided into three types:

the method is based on a vehicle detection method, and mainly judges the fatigue state by collecting vehicle driving parameters and analyzing abnormal fluctuation of the parameters. The detection method comprises the steps of detecting the turning angle degree of the steering wheel, detecting the steering grip strength of the steering wheel, detecting the vehicle speed, detecting the vehicle deviation, detecting the brake pedal force, detecting the accelerator pedal force and the like. Most of the current vehicles are equipped with different types of sensors for collecting real-time parameters such as driving speed, steering wheel angle, fuel consumption and engine speed, and the fatigue state of a driver can be indirectly detected through single or comprehensive analysis of the data. However, the analysis result of the method is easily affected by external environmental factors such as personal driving habits, weather, vehicle characteristics and road conditions, and the method is not strong in robustness and low in recognition accuracy. And the abnormality can be detected only when the driver is about to have a traffic accident, and early warning cannot be carried out. Therefore, the analysis result of the method is preferably used as an auxiliary detection index rather than a main detection index.

The other is a detection method based on drivers, and the methods can be divided into methods based on the physiological parameters of the drivers and methods based on the behavior characteristics of the drivers. Relevant research shows that when a driver is in a fatigue state, physiological response can be slowed down, the stimulation response of the body to the outside can be delayed, and the physiological indexes can deviate from normal values. Therefore, the physiological parameters of the driver collected by the physiological sensor can be used for judging whether the driver is in a fatigue state, and the physiological parameters are mainly detected based on electroencephalogram (EEG), electrocardio signals (ECG), Electromyogram (EMG) and the like. However, in the actual fatigue detection application, the physiological parameters are greatly different among individuals, so that the physiological parameters are easily influenced by factors such as sex, age and body type of a driver, are not favorable for fatigue judgment by adopting a unified standard, and are limited in the actual application. When the driver is drowsy, its facial features will be different from those of the awake state. Therefore, the method for detecting the fatigue driving in real time is effective by analyzing the facial feature data of the driver by using the computer vision technology. The characteristic parameters extracted by the method mainly comprise eye movement characteristics (blink frequency, PERCLOS, eye opening and closing degree, gazing direction and the like), mouth states (yawning frequency and the like) and head positions. Since the change in the head and facial features is relatively significant, it is easily detected. However, the feature extraction, i.e., the detection result, is susceptible to factors such as occlusion and illumination, resulting in low recognition accuracy.

And thirdly, the detection method based on information fusion, which integrates various fatigue characteristics, has improved detection precision and reliability compared with the fatigue detection method based on single characteristic information, but has great challenges in extracting various characteristics and establishing a model based on the information fusion detection method by using the prior art, and the established fatigue detection model has poor applicability to complex environments.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a fatigue driving identification method and system based on multiple characteristics, and aims to solve the problems of low identification accuracy and poor adaptability of the existing detection method.

The invention solves the technical problems through the following technical scheme: a fatigue driving identification method based on multiple characteristics comprises the following steps:

step 1: acquiring a video single-frame image in real time, and preprocessing the video single-frame image;

step 2: performing face detection on the preprocessed video image by adopting an AdaBoost algorithm based on Haar-like characteristics, and tracking the detected face in real time by adopting a target tracking algorithm based on a scale space;

and step 3: the method comprises the steps of positioning feature points of a human face, respectively positioning an eye region and a mouth region according to the positioned feature points, identifying an eye state by adopting an SVM (support vector machine) classifier, and identifying a mouth state by calculating the aspect ratio of the mouth;

and 4, step 4: respectively calculating eye fatigue parameters and mouth fatigue parameters according to the eye state and the mouth state, and calculating head fatigue parameters according to the positioned position information of the feature points;

and 5: and identifying and early warning the fatigue state of the driver according to the eye fatigue parameter, the mouth fatigue parameter and the head fatigue parameter.

The method of the invention preprocesses the image, not only filters out noise, but also avoids the problems of poor image quality and low detection precision caused by the influence of external environment factors on the image; by Ad_aBoo_sthe t algorithm can stably, quickly and efficiently detect the face, and the complexity of face detection is reduced; the scale space-based facial target tracking algorithm adopts a self-adaptive high-confidence updating strategy, when an error occurs in a target tracking stage, the confidence of target detection is low, and a model is not updated, so that the risk of drifting of the tracking algorithm is effectively reduced, and the tracking precision is improved; the eye state recognition is carried out by adopting the SVM classifier, so that the accuracy of the eye state recognition is improved, and the method is high in recognition accuracy and strong in adaptability to the environment.

Further, in step 1, the video single-frame image preprocessing process includes:

step 1.1: carrying out smooth denoising processing on a video single-frame image;

step 1.2: and carrying out illumination compensation processing on the video image subjected to the smooth denoising processing.

By preprocessing the video image, noise interference in the image can be filtered, the image is prevented from being influenced by external environment factors, the quality of the image is improved, and the accuracy of subsequent detection and analysis is improved.

Further, in the step 1.1, a smooth denoising process is performed on the video image by using adaptive median filtering.

The self-adaptive median filtering can simultaneously take account of denoising and image detail information retention when the noise density is high, effectively filters noise interference existing in an original image, can retain useful information in the image while improving the quality of the image, improves the signal to noise ratio, and enables the image to be better suitable for the application of a specific scene.

Further, in the step 1.2, an illumination equalization algorithm based on a dynamic threshold is adopted to perform illumination compensation processing on the video image with different illumination brightness.

The problem that the human face cannot be accurately detected and the features of the human face cannot be extracted due to the fact that the image is unevenly distributed when light is received is avoided, and the image is prevented from being influenced by factors such as illumination intensity, light source color and position.

Further, in the step 2, the specific operation steps of the AdaBoost algorithm for face detection are as follows:

step 2.11: calculating Haar-like characteristics of the image by using the integral graph;

step 2.12: for the Haar-like characteristics, selecting an optimal weak classifier through training iteration, and constructing the weak classifier into a strong classifier according to a weighted voting mode;

step 2.13: then connecting a plurality of strong classifiers obtained by training in series to form a cascade classifier with a cascade structure;

step 2.14: and carrying out face detection on the image by adopting a stacked classifier.

Further, in step 2, a target tracking algorithm based on a scale space is adopted to track the detected face in real time, and the specific operation steps are as follows:

step 2.21: taking a face area and a scale detected by a face as an initial position P of a target₁And the dimension S₁And carrying out position correlation filter and scale correlation filter training on the face region to obtain a position model

Sum scale model

Step 2.22: according to the last frame I_t-1Target position P of_t-1Sum scale S_t-1At the current frame I_tA characteristic sample with the size 2 times of the target of the previous frame is collected

Using characteristic samples

And the position model of the previous frame

Calculating the maximum response value of the position-dependent filter to obtain the new position P of the target_t；

Step 2.23: from the determined new position P of the target_tTaking the current new position as a central point, utilizing a one-dimensional scale correlation filter to obtain S candidate samples with different scales according to a scaling rule, and respectively extracting d-dimensional features from each candidate sample to obtain a feature sample of the current frame

Then using the feature sample

Sum scale model

Calculating the response value of the 1 × S-dimensional scale correlation filter, wherein the scale corresponding to the maximum response value is the scale S of the final target_t；

Step 2.24: if the maximum response value and the average peak value correlation energy of the current frame correlation filter both meet the updating strategy condition, then the current frame I_tAccording to position P_tSum scale S_tExtraction of feature f_t ^trans、f_t ^scaleUpdating the location model

Sum scale model

Otherwise in the current frame I_tCarrying out face detection again;

the updating strategy conditions are that the maximum response value and the average peak correlation energy are respectively larger than the ratio β₁Sum ratio β₂，β₁Is 0.7, β₂Is 0.45.

Preferably, the calculation expression of the response value of the position or scale dependent filter is:

wherein, F^-1() DFT, y for inverse discrete Fourier transform_tFor the response value obtained, d-dimensional features are extracted from each pixel of the feature sample, wherein the feature map of the l-th dimension is marked as f^lWhere l is 1,2, …, d, l is a certain dimension of the feature, λ is the coefficient of the regular term,

respectively the numerator and denominator of the filter updated in the previous frame,

to find the two-dimensional DFT of each dimension of the feature map of the current frame image.

Further, in step 3, a cascade regression tree algorithm is adopted to locate the feature points of the human face, where the feature points of the human face include eye feature points and mouth feature points.

Further, in step 3, the specific operation of identifying the eye state by using the SVM classifier is as follows:

training an SVM classifier by taking the aspect ratio of human eyes and the accumulated difference value of black pixels in a binary image area of the human eyes as input characteristics of the SVM classifier, classifying and identifying the eye state by adopting the trained SVM classifier,the accuracy of eye state recognition is improved; the black pixel accumulated difference value F of the human eye binary image area_{Black colour}The calculation formula of (2) is as follows:

T(t)＝α*|D(t-1)|,α∈[0,1]

where n (t) is the number of black pixels of the t-th frame, Δ n (t) is the difference in the number of black pixels between the t-th frame and the t-1-th frame, D (t-1) is the accumulated difference in the number of black pixels of the t-1-th frame in "state 1", and α is a constant value between 0 and 1.

Further, in the step 3, the aspect ratio of the mouth part is MAR, and when MAR is less than or equal to 0.4, the mouth part is in a closed state; when MAR is more than 0.4 and less than or equal to 0.8, the mouth is in a normal speaking state; when MAR >0.8, the mouth is in the yawning state.

Further, in the step 4, the eye fatigue parameters include the ratio of eye closure frame number, blinking frequency and longest duration eye closure time, the mouth fatigue parameters include yawning frequency, and the head fatigue parameters include nodding frequency; preferably, the fatigue state is identified by performing weighted summation on the eye fatigue parameter, the mouth fatigue parameter and the head fatigue parameter, and the specific weighted summation expression is as follows:

E_fatigue＝V_ECR×W₁+V_MECT×W₂+V_BF×W₃+V_NF×W₄+V_YF×W₅

Wherein E is_FatigueTo weight the fatigue value, V_ECRThe number of closed-eye frames, V_MECTFor the longest duration of eye closure time, V_BFFrequency of blinking, V_NFFor nodding frequency, V_YFTo beat the frequency of yawning, W_iThe weight values corresponding to the different parameters are set,

preferably, when the weighted fatigue value is less than 0.3, the state is the waking state; when the weighted fatigue value is more than or equal to 0.3 and less than 0.7, the state is a fatigue state; when the weighted fatigue value is 0.7 or more, the fatigue state is severe.

The invention also provides a fatigue driving recognition system based on multiple characteristics, which comprises:

the image acquisition and processing unit is used for acquiring a video single-frame image in real time and preprocessing the video single-frame image;

the face detection and tracking unit is used for carrying out face detection on the preprocessed video image by adopting an AdaBoost algorithm based on Haar-like characteristics and tracking the detected face in real time by adopting a target tracking algorithm based on a scale space;

the positioning and state recognition unit is used for positioning the feature points of the human face, respectively positioning the eye region and the mouth region according to the positioned feature points, recognizing the eye state by adopting an SVM classifier, and recognizing the mouth state by calculating the aspect ratio of the mouth;

the parameter calculation unit is used for calculating eye fatigue parameters and mouth fatigue parameters according to the eye state and the mouth state respectively and calculating head fatigue parameters according to the positioned feature point position information;

and the fatigue state identification unit is used for identifying and early warning the fatigue state of the driver according to the eye fatigue parameter, the mouth fatigue parameter and the head fatigue parameter.

Advantageous effects

Compared with the prior art, the fatigue driving identification method and system based on multiple characteristics, provided by the invention, preprocesses the image, not only filters out noise, but also avoids the problems of poor image quality and low detection precision caused by the influence of external environment factors on the image; the AdaBoost algorithm can be used for stably, quickly and efficiently detecting the human face, so that the complexity of human face detection is reduced; the scale space-based facial target tracking algorithm adopts a self-adaptive high-confidence updating strategy, when an error occurs in a target tracking stage, the confidence of target detection is low, and a model is not updated, so that the risk of drifting of the tracking algorithm is effectively reduced, and the tracking precision is improved; the eye state recognition is carried out by adopting the SVM classifier, so that the accuracy of the eye state recognition is improved, and the method is high in recognition accuracy and strong in adaptability to the environment.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow chart of a method of identifying fatigue driving in an embodiment of the present invention;

FIG. 2 is a flow chart of face detection and facial target tracking according to an embodiment of the present invention;

FIG. 3 is a rectangle D in an embodiment of the present invention₀Region pixels and a computational schematic;

FIG. 4 is a flowchart of target position estimation in the flow of facial target tracking in an embodiment of the present invention;

FIG. 5 is a sample of a scale filter in an embodiment of the invention;

fig. 6 is a target size estimation flow chart in the face target tracking flow in the embodiment of the present invention;

FIG. 7 is a face feature point model in an embodiment of the invention;

FIG. 8 is a diagram illustrating the detection results of facial feature points from different angles according to an embodiment of the present invention;

fig. 9 is a schematic diagram of eye positioning based on feature points in an embodiment of the invention, where fig. 9(a) is a human face feature point model, and fig. 9(b) is a schematic diagram of eye positioning;

FIG. 10 is a schematic diagram of six key points of a human eye in an embodiment of the present invention, with FIG. 10(a) in an open-eye state and FIG. 10(b) in a closed-eye state;

FIG. 11 is a graph of EAR mean results in an embodiment of the present invention;

FIG. 12 shows the number of black pixels in the process of opening and closing the eyes of the human eye according to the embodiment of the present invention;

FIG. 13 is a diagram illustrating the difference between the number of black pixels in two consecutive frames;

FIG. 14 is a cumulative difference of the number of black pixels for the human eye in an embodiment of the invention;

FIG. 15 is a cumulative difference of the number of black pixels for an adaptive threshold human eye in an embodiment of the invention;

FIG. 16 is a schematic diagram of the 10 key points of the mouth in an embodiment of the present invention;

FIG. 17 is a graph showing the results of mouth MAR detection in the embodiment of the present invention;

FIG. 18 is a schematic illustration of the opening and closing process of an embodiment of the present invention;

FIG. 19 is the EAR threshold, frame number K in the embodiment of the present invention_{Eye (A)}A value optimizing result graph;

fig. 20 is a schematic view of a state of a mouth in an embodiment of the present invention;

fig. 21 is a diagram of head motion analysis in an embodiment of the present invention.

Detailed Description

The technical solutions in the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the fatigue driving identification method based on multiple features provided by the present invention includes:

1. the method comprises the following steps of acquiring a video single-frame image in real time, and preprocessing the video single-frame image, wherein the preprocessing process comprises the following specific steps:

and (1.1) performing smooth denoising processing on the video single-frame image by adopting self-adaptive median filtering.

The final purpose of denoising the video image is to improve the quality of the acquired image and retain useful information carried in the original image. The problem that the image quality is reduced due to noise interference of an actual image can be effectively solved through a filtering denoising technology, the signal to noise ratio is increased, and the image is better suitable for application of a specific scene. The adaptive median filtering can dynamically change the size of a filtering template according to a preset template, and can judge whether the current pixel is noise or not, if so, the current pixel value is replaced by a neighborhood median, and the processing process comprises two steps:

step A: order to

If A is₁>0 and A₂<0, go to step B, otherwise increase the size of the filtering template, and remember that the increased template size is S_{Form panel}(ii) a If S is_{Form panel}≤S_{Template max}Repeating step A, otherwise, making Z_xy＝Z_medOutput Z_xy；

And B: order to

If B is present₁>0 and B₂<0 then outputs Z_xyOtherwise, output Z_med；

Wherein S is_{Form panel}For the filter template matrix size, point (x, y) is the center point of the filter template matrix, S_xyRepresenting the filtering area, S, centered on the point (x, y)_{Template max}Maximum window size, Z, allowed for the filter template (filter window)_minIs the minimum pixel value, Z, in the filter window_maxIs the maximum pixel value, Z, in the filter window_medBeing the median value of the pixel values in the filtering window, Z_xyIs the pixel value at point ((x, y).

In order to compare the smooth denoising effects of different filtering methods on images, salt and pepper noise with the intensity of 0.1 and Gaussian noise with the average value of 0.1 and the variance of 20 are respectively added to a test image, then the salt and pepper noise and the Gaussian noise images are respectively subjected to smooth denoising processing by using different filtering methods, and the processing results of different methods are compared and analyzed. The comparison shows that the denoising effect of the median filtering and the self-adaptive median filtering on the salt-pepper noise is obviously better than that of the other two methods, and the mean filtering has a better denoising effect on the Gaussian noise.

TABLE 1 comparison of the indexes of the algorithms for salt and pepper noise

In order to more objectively verify the denoising effect of each method, the mean square error MSE and the peak signal-to-noise ratio PSNR before and after image processing and the algorithm running time T of each filtering method are respectively calculated, the calculation formulas of MSE and PSNR are shown in formulas (1) and (2), and the result analysis is shown in tables 1 and 2.

Where f (x, y) represents a noisy image of size M × N, and f^*(x, y) represents the filtered denoised image.

Where MAX is the maximum pixel value that can be used in an image.

TABLE 2 Gauss noise Algorithm index comparison

Through comprehensive analysis, the three filtering methods can achieve a certain degree of denoising effect, but a filtering template needs to be set in advance, edge details and contours in an image are blurred when the image is filtered and denoised, and the image needs to be sharpened at a later stage to highlight edge information of the image. Therefore, in order to retain image detail information to the maximum extent while smoothing denoising, the invention uses the adaptive filtering algorithm to improve the image denoising capability.

And (1.2) performing illumination compensation processing on the video image with different illumination brightness by adopting an illumination balance algorithm based on a dynamic threshold value.

The collected color information of the video image is easily influenced by factors such as illumination brightness, light source color and position, and the like, so that the video image is unevenly distributed. In order to accurately detect a human face from a human face image which cannot be illuminated with light at an intensity and extract features, the human face needs to be firstly detectedAnd carrying out illumination equalization processing on the image. Converting an image from an RGB color space to YCbC according to equation (3)_rAnd the color space performs illumination equalization processing on the image.

The processing procedure is divided into two steps of detecting reference white points based on dynamic threshold values and adjusting image pixels. Selection of reference white point: an image is first divided into M blocks according to an appropriate aspect ratio (block size), and an average M of Cb and Cr is calculated for each block_b、M_rThen, the average absolute difference D is calculated from the equation (4) respectively_b、D_r。

Wherein N is the total number of pixels of the image block, C_b(i, j) and C_r(i, j) is Cb, Cr value (chroma value) of pixel (i, j), for each region block, if D_bAnd D_rIf the color distribution is too small, the color distribution of the block is relatively uniform, and no treatment is needed; then M for each region block to be processed_b、M_r、D_b、D_rM is taken as the whole image after summing and averaging_b、M_r、D_b、D_rThe value satisfying the relation (5) is a set of pixel points in the near-white region of the image.

Based on the brightness value (Y value), the pixel points 10% before the brightness value of the near white area are selected as the reference white points.

Adjustment of the image: in order to keep the brightness of the whole image the same, the gain R of each channel can be obtained by referring to the average value of the white point in the RGB channels and the maximum brightness value (maximum value of Y value) of the whole image_gain、G_gain、B_gainThe calculation formula is as follows:

wherein R is_avg、G_avg、B_avgAverage value of reference white point in RGB channels, Y_maxThe maximum brightness value of the pixel points in the image. The pixel value of each pixel point in the image is adjusted in the following way:

R′＝R*R_gain，G′＝G*G_gain，B′＝B*B_gain(7)

r, G, B is the original pixel value of the image, and R ', G ', B ' are the adjusted pixel values of the image.

2. The preprocessed video image is subjected to face detection by adopting an AdaBoost algorithm based on Haar-like characteristics, and the detected face is tracked in real time by adopting a target tracking algorithm based on scale space, as shown in FIG. 2.

Harr-like features can be classified into three categories: linear feature, edge feature, point feature (central feature), diagonal feature, Harr-like feature value is the difference between the sum of all pixel gray values in the white rectangle and the sum of all pixel gray values in the black rectangle, which reflects the gray change condition of the image. The Haar-like characteristics can effectively extract the texture characteristics of the image, and the characteristic values of different positions and scales are extracted through translation and scaling of the template.

Due to the variation of the category, the size and the position of the Haar-like characteristic rectangular template, even the size of the detection template or the window is small, the detection template or the window contains a great number of rectangular characteristic values. The number of rectangular features within a detection window of size 24 x 24 may also reach hundreds of thousands, as after the form of the features is determined. Due to the large number of features, fast calculation for solving the features is important.

The integral graph algorithm can calculate the pixel sum of any rectangular area in the image only by traversing the image once, and the calculation efficiency of the image characteristic value is improved to a great extent. The main idea is as follows: the sum of pixels from the starting point to each point of each rectangular region of the image is calculated, the value of each region is calculated and is stored in an array as an element, when the pixel sum of a certain region needs to be calculated subsequently, the value of a target region can be obtained by directly using an array index, recalculation is not needed, and calculation is accelerated.

The value of any point (i, j) on the integral image is the sum of the gray values of all pixel points in a rectangular area enclosed by the upper left corner of the gray image and the current point. The integral graph calculation formula is shown in formula (8):

where I (x, y) is the gray value at point (x, y). The integral graph can also be simplified by iterative operations as follows:

I′(i,j)＝I′(i,j-1)+I′(i-1,j)-I′(i-1,j-1)+I(i,j) (9)

wherein the boundary point

I′(-1,j)＝0,

I′(-1,-1)＝0。

After obtaining the integral map, the feature value of the rectangular region is calculated only in relation to the integral map of the end point of the feature rectangle, so that the time consumed for calculating the feature value is fixed regardless of the scale transformation of the feature rectangle. The difference between the pixel sums of the two matrix areas is calculated only by calculating an integral graph of the end point of the characteristic area to perform simple addition and subtraction operation, so that the characteristic value of any rectangular area can be quickly calculated.

As D in FIG. 3₀The area to illustrate the integral graph algorithm:

integral I 'of endpoint 1'₁＝Sum(A₀) (ii) a Integral I 'of endpoint 2'₂＝Sum(A₀)+Sum(B₀)；

Integral I 'of endpoint 3'₃＝Sum(A₀)+Sum(C₀) (ii) a Integral I 'of endpoint 4'₄＝Sum(A₀)+Sum(B₀)+Sum(C₀)+Sum(D₀)；

Wherein Sum (N)₁) Indicating the region N₁Is then the sum of all pixels, then region D₀The sum of all pixels of (a) is:

Sum(D₀)＝I′₁+I′₄-I′₂-I′₃(10)

Ad_aBoo_sthe t algorithm is a classifier algorithm, and the algorithm principle is as follows: fast computation of H for an image using an integral graph_aar-lik_eThe method is characterized in that an optimal weak classifier is selected through training iteration, the weak classifier is constructed into a strong classifier according to a weighting voting mode, and then a plurality of strong classifiers obtained through training are connected in series to form a cascade-structured stacked classifier, so that the detection speed and the accuracy of the classifier are improved. The algorithm trains a plurality of weak classifiers through the probability distribution of the positive and negative sample sets, the sample weight is updated once in each cycle, T weak classifiers are obtained after T cycles, and the strong classifiers are finally obtained through weight superposition.

Given a training data set:

T＝{(x_i,y_i)}，i＝1,2,...,N_T(11)

wherein x is_iFor the trained image, y_iBelong to x_iCorrectly classify the label set { -1, +1}, if y_iIf the image is a positive sample, namely, the image is a face image, if y is 1_iAnd-1, the image is a negative sample, that is, the image does not contain a human face. The training algorithm for the sample flows as follows:

firstly, initializing weight distribution of training data, making the weight of every sample be identical to make

Wherein D₁Denotes the first iteration, w_1iRepresenting the weight of the ith sample of the first iteration.

②, making

m

1,2_TAnd m is the number of iterations.In D with weight distribution_mLearning on data to obtain weak classifier H with lowest error_m(x) X → { -1, +1} with a classification error rate:

and thirdly, the weight coefficient of each iteration of the weak classifier is as follows:

fourthly, updating weight distribution of the training set:

wherein Z_mIn order to normalize the factors, the method comprises the steps of,

and fifthly, combining the weak classifiers to finally obtain a strong classifier through continuous iteration:

because an AdaBoost face detection algorithm based on haar-like characteristics is packaged in an open source library OpenCV, the invention carries out face detection by utilizing a haarcascade _ front _ default.xml classifier file which is self-trained in OpenCV, and a CascadeClassifier is a cascade classifier class defined by the OpenCV, wherein a multi-scale detection method is packaged, an image to be detected is input, the face detection is carried out on the image to be detected by loading the xml classifier file for detecting the face, and a possible face area rectangular frame is output.

And table 3 shows the accuracy of face detection by using the AdaBoost algorithm and the threshold skin color model under the conditions of similar skin color background interference and no interference. Through the analysis of the table 3, it is found that when a similar skin color background exists in a video image or similar skin color areas exist in other parts of a human body, the interference of the similar skin colors makes the detection range of a face detection algorithm based on a threshold skin color model not accurate enough, which may cause the occurrence of false detection. AdaBoost is mainly used for carrying out face classification detection according to haar characteristics, so that the interference of similar skin colors can be eliminated, the calculation efficiency and the accuracy are high, and the face can be quickly detected without carrying out characteristic screening, so that the face detection is carried out by using an AdaBoost algorithm.

Table 3 comparison of AdaBoost algorithm with threshold skin color model face detection accuracy

Considering that the variation range of the face position of the driver is small in the actual driving process, if the face detection and positioning is performed on each frame of the video image, not only the time complexity is increased, but also the interrelation between the continuous frames cannot be fully utilized. Therefore, in order to better position the face in the subsequent video image and improve the accuracy and robustness of detection, after the face is detected for the first time, the detected face is tracked in real time by adopting a target tracking algorithm based on a scale space.

The DSST (discrete Scale Space Tracker, DSST) algorithm is improved on the basis of the MOSSE algorithm, although the MOSEE algorithm improves the tracking accuracy and simultaneously reduces the complexity of calculation, the performance of a related filter tracking algorithm is greatly improved, when a filter is solved, the input of the MOSEE algorithm is the gray level feature of an image, and the feature dimension used by a model is too low to well reflect the characteristics of texture, edge and the like of a target. And only the translational motion of the central point of the target area between frames is estimated, the scale change of the target in the motion process is not considered, and the target cannot be well tracked when the scale of the target is changed. Based on the deficiency of the MOSSE algorithm, M Danelljan, G

F Khan, el at proposes a three-dimensional scale space correlation filter translation-scale joint tracking method. The DSST replaces the original grayscale feature with the HOG feature, so that the target feature can be better described. In addition, in order to better adapt to the scale change of the tracked target, a scale correlation filter is added, and the position change and the scale change are tracked through the two filters respectively. A two-dimensional position Filter (transformation Filter) is used for evaluating the target position change, a one-dimensional Scale Filter (Scale Filter) is used for carrying out target Scale estimation, and a three-dimensional joint position and Scale Filter transformation-Scale is used for target positioning. The two filters are relatively independent and therefore can be trained and tested using different features and feature computation approaches.

(1) Position dependent filter

Filter training

Collecting a sample with the size 2 times of the target size, extracting d-dimensional features from each pixel of the sample, and recording the feature map as

f

^l1,2, …, d. To construct the optimal correlation filter h, the following objective function is minimized over the different feature dimensions i:

★ represents cyclic correlation, l represents a certain dimension of the characteristic, and λ is a coefficient of a regular term and is set to be 0.01, wherein the term λ is used for avoiding the condition that the denominator is zero in the process of solving the frequency domain parameter of the filter, and simultaneously, the variation range of the filter parameter can be controlled, the smaller the λ is, the larger the variation range of the filter parameter is, the expected correlation output g is a Gaussian function with parameterized standard deviation, and f is^l，h^lAnd g all have the same dimensions and size.

Fourier transform is carried out on the formula (18), and a filter is obtained by solving the partial derivative and making the derivative be 0

WhereinCapital letters represent corresponding values after Discrete Fourier Transform (DFT), that is, F is obtained by performing two-dimensional DFT on the characteristic of each dimension of F^lAnd performing two-dimensional DFT on G to obtain G.

For all training samples f₁,f₂,...,f_tTo simplify the calculation of equation (19), the filters are updated separately

Of (a) a molecule

And denominator

The calculation formula is as follows:

where η represents the learning rate (η ═ 0.025), and t represents the number of samples. Substituting both G and F into the above equation, the value of the filter template H can be obtained. The simplified calculation of equation (19) is:

estimation of target position

The target position estimation process shown in FIG. 4, for the feature map z of the t-th frame image_tSimilarly, the two-dimensional DFT of each dimension z is obtained

Obtaining the maximum correlation filter response value y by solving the inverse DFT at the target position_tTo determine:

wherein

And

is the numerator and denominator of the filter updated in the previous frame.

(2) Scale filter

The model updating and the filter response solving process in the training process of the scale filter are consistent with the position filter.

Filter training

Corresponding to fig. 5, the target position is used as the center for scaling sampling, and the scale selection principle is as follows:

wherein P × R_SRepresenting the target scale in the current frame, a is the scaling factor (a equals 1.02) and S is the size of the scale filter (S equals 33).

And (3) scaling the target image according to the formula (23), selecting S samples with different scales, and extracting d-dimensional hog features from each sample to form a pyramid with the number of layers being S. Taking the feature as a training sample, and the feature f of each dimension^lFor a vector of 1 × S, performing one-dimensional DFT on the feature of each dimension of F to obtain F^lAnd performing one-dimensional DFT on G to obtain G, wherein G is an output response constructed by a Gaussian function and has the size of 1 × S, and a correlation filter H is obtained according to the formula (21) and used for predicting an output scale.

Size estimation

As shown in fig. 6, in a new frame, a two-dimensional position-dependent filter is used to determine a new candidate position of the target, then a one-dimensional scale-dependent filter is used to obtain S candidate blocks with different scales with the current central position as the central point, d-dimensional features are respectively extracted to form a new feature map Z, and a DFT of each dimension is obtained to obtain Z^lThen, the value of y is obtained according to the formula (22), wherein y is a vector with the dimension of 1 × S, and the scale corresponding to the maximum value in the vector y is the scale of the final target.

Since the DSST algorithm requires manual marking of the initial framePosition, and tracking cannot be performed well when the target is blocked by a foreign object or lost, so that it is necessary to determine model update using feedback of the tracking result during target detection. The peaks and fluctuations of the response map may reveal to some extent the confidence of the tracking result. Therefore, two confidence indexes, the maximum response value F, are introduced_maxAnd average peak-to-correlation energy (APCE). In general F_maxThe larger the tracking effect, the better, the APCE reflects the degree of fluctuation of the response map and the confidence level of the detected target.

Wherein F_max、F_minRepresenting the maximum and minimum of the response, F_w,hRepresents the value of the position of the response map (w, h). When the detected target is a close match to the correct target, the response map should have only one sharp peak and be smooth in all other regions, where APCE will become larger, the sharper the correlation peak, and the higher the positioning accuracy. The APCE will be significantly reduced if the object is occluded or lost. If F of the current frame_maxAnd APCE are both greater than the ratio β₁，β₂(β₁＝0.7，β₂0.45), the tracking result in the current frame is considered as high-reliability, and then the model is updated, otherwise, the face detection needs to be performed on the current frame again.

As shown in fig. 2, the specific operation steps are as follows:

Sum scale model

Using characteristic samples

And the position model of the previous frame

Calculating the maximum response value of the position-dependent filter according to equation (22) to obtain a new position P of the target_t；

Then using the feature sample

Sum scale model

Calculating the response value of the 1 × S-dimensional scale correlation filter according to the formula (22), wherein the scale corresponding to the maximum response value is the scale S of the final target_t；

Step 2.24: if the maximum response value and the average peak value correlation energy of the current frame correlation filter both meet the updating strategy condition, then the current frame I_tAccording to position P_tSum scale S_tExtraction of feature f_t ^trans、f_t ^scaleUpdating the position model according to equation (20)

Sum scale model

Otherwise in the current frame I_tAnd face detection is carried out again.

3. The method comprises the steps of positioning feature points of a human face by adopting a cascade regression tree algorithm, respectively positioning an eye region and a mouth region according to the positioned feature points, identifying an eye state by adopting an SVM (support vector machine) classifier, and identifying a mouth state by calculating a mouth aspect ratio.

The human face key point detection method based on the cascading Regression Tress (ERT) algorithm learns the local features of each key point, combines the features and detects the key points by using linear Regression. The ERT algorithm is a cascaded regression tree based face keypoint localization algorithm proposed by Kazemi and Sullivan, which selects 68 key feature point models of the labeled face, as shown in fig. 7, and proposes a general framework based on a gradient enhancement algorithm for learning the cascaded regression tree and using the cascaded regression tree to estimate the landmark locations of the face directly from a sparse subset of pixel intensities. The algorithm includes two processes: training and establishing a model and fitting the model.

Firstly, establishing a model

The algorithm uses two layers of regression to build a mathematical model. The first-level regression iteration formula is:

wherein S_{Shape of}Is a vector of the shape of the object,

coordinates, X, representing all p facial markers in image I_i∈R²Is the coordinates (x, y) of the ith facial marker in image I.

The feature point coordinate set shape vector predicted for the t-th iteration,

for the results of the t +1 th iteration prediction, each regressor

In the cascade, update vectors from images are predicted, the input of which is the current training picture and shape vector, and the output of which is the amount of location update for all keypoints. In the cascade regressor of the layer, every time the cascade regressor passes through the first-level cascade regressor, the positions of all key points are updated once to obtain more accurate positions.

The second layer of regression is the regressor r_tInternal iteration of (2). Let us assume a training data set { (I)₁,S_{Shape 1}),...,(I_n,S_{Shape n}) N is the number of samples, I_iFor face images, S_{Shape i}As an image I_iAnd (4) corresponding to the position shape vector of the key point of the human face. To learn the regression function r in the cascade_tCreating triplets of face images from training data

Wherein

For the face image in the data set,

the keypoint shape vectors are predicted for the ith iteration of the first-level cascaded regression,

are the true value and the predicted difference value.

The process is iterated through the above equation until a T-level regression r is learned₀,r₁,...,r_t-1Is cascaded.

For training data

Learning rate 0<υ<1, regression function r_tLearning is performed by using a gradient tree enhancement algorithm with a sum of squared error losses as follows:

(1) initialization function

Wherein K1, K:

(2) fitting regression Tree r by N iterations_ikTo obtain a weak regression function

Wherein i 1_ikThe expression is as follows:

(3) updating according to the obtained weak regression function

(4) Repeating the steps (2) and (3) until K times of iteration are carried out to obtain

(5) Obtaining a regression function

Model fitting

Obtaining a regression model through K iterations, wherein the specific steps of model fitting are as follows:

(1) and initializing a feature point shape vector of each face image, wherein the initial shapes of all the images are the same.

(2) And establishing a feature pool, randomly selecting two points in the feature pool, and calculating the pixel difference of each image at the two points according to the shape of the feature points of the image.

(3) And constructing a regression tree. And randomly generating a splitting threshold, splitting towards the left if the pixel difference value of the image is smaller than the threshold, splitting towards the right if the pixel difference value of the image is not smaller than the threshold, and splitting all the images according to the method to divide the image into a left part and a right part. Repeating the process for several times to obtain the optimal node theta by minimizing the square error^·The objective function is as follows:

wherein the nodes to be selected are theta, l and r respectively representing left and right subtrees mu_θ,sRepresenting the results produced according to the current partition. And after obtaining the optimal node, storing the coordinate values and the splitting threshold of the two characteristic points. This step is then repeated for each node split until a leaf node is reached.

(4) The residual of each leaf node is calculated. And calculating the difference value between the current shape and the real shape of each image, averaging the sum of the difference values of all the images in the same leaf node, and storing the residual error into the leaf node.

(5) The shape of each image is updated. The current shape S_{Shape of}Updating to the current shape plus residual i.e. (S)_{Shape of},△S_{Shape of})。

(6) And (4) repeating the processes from (2) to (4) until the finally obtained feature point shape vector represents a real shape.

The Dlib is a cross-platform open source library that provides many implementations for machine learning, deep learning, image processing, and other algorithms. Because the Dlib open source library realizes the ERT algorithm, and a face key point detector is trained on the iBUG 300-W data set, the detector can find the 68 feature points on any face, and therefore the invention uses the algorithm realized by the Dlib open source library to detect the face key points. The experimental result is shown in fig. 8, and it can be seen from the experimental result that the ERT algorithm has better robustness to different facial expressions and head directions, and can well realize the positioning of the facial feature points at different angles.

In order to simply and quickly position the human eyes, the invention positions the eye region according to the positions of the human eye feature points on the basis of the detection of the facial key points.

As shown in the model of a) human face feature points in FIG. 9, the position of each feature point can be known according to the serial numbers of the key points in the figure, for example, the serial numbers of the left eye are 36-41, and the serial numbers of the right eye are 42-47. The extracted left and right eye regions are rectangular regions shown as b) in fig. 9 according to the serial numbers of the eye feature points. The positioning calculation rule is as follows:

where W _ e is the horizontal distance of the eye feature points 36 and 39, H _ e is the average of the vertical distances of the feature points 37, 41 and 38, 40, and W and H are the width and height of the localized eye region.

In order to accurately and quickly recognize the open/close state of the eyes, the aspect ratio (EAR) of the eyes is calculated, which is substantially small in the difference between individuals when the eyes are open and is completely invariant to the uniform scaling of the image and the rotation of the face. As shown in fig. 10 for the 6 key points (P1-P6) detected for the left eye in the open and closed states, the eye aspect ratio is calculated as:

wherein the numerator represents the euclidean distance between the eye vertical feature points and the denominator is the euclidean distance between the eye horizontal feature points.

Taking the left eye as an example, according to the six feature points, the euclidean distances between the vertical key points and between the horizontal key points can be calculated, and the calculation formula of the euclidean distances between the two points is as follows:

wherein P is_a·x、P_aY are the coordinates x and y of point a, respectively. The horizontal and vertical euclidean distance of the eye can be expressed as

Eye_h＝Dis(P₁,P₄) (35)

Eye_v＝Mean(Dis(P₂,P₆),Dis(P₃,P₅)) (36)

Where Mean (A, B) represents taking the average of A and B. The aspect ratio of the eye at this time can be expressed as:

according to equation (37), the aspect ratio of the left and right eyes of a video image is calculated for 200 consecutive frames, and the EAR value becomes small and substantially constant when the eyes are open, but becomes small and approximately zero when the eyes are closed. The eyes are closed or opened basically synchronously, and for more accurately identifying the eye state, the average value of the EAR of the eyes is taken as the characteristic of eye opening and closing identification:

EAR＝Mean(EAR_left,EAR_right) (38)

according to the above formula, eye state recognition is performed, and EAR mean values of both eyes are calculated, and the result is shown in fig. 11. When blinking occurs, the EAR value decreases rapidly close to 0 and then increases slowly close to the EAR value at which the eyes are normally open. According to this phenomenon, the EAR value can be used as a feature value for identifying the open-closed eye state, and blink detection can also be performed based on the EAR value.

After the human eye area is positioned, a local adaptive threshold algorithm is selected to carry out binarization on the human eye image, and after morphological opening operation and median filtering processing, the outline and the details of the eye can be better presented. When the human eye is closed, the maximum dark pupil region does not appear although it may be affected by dark regions such as eyelashes and eyelids. The number of black pixels in the binary image is drastically reduced when the eyes are closed, compared to open eyes. However, the number of black pixels may vary with the distance between the human eye and the camera. As the distance becomes larger, the eye area is reduced in the image, and thus the number of black pixels is reduced. Fig. 12 shows the number of black pixels in the eye region during the process of opening and closing the right eye, and it can be seen from fig. 12 that a threshold value can be set to distinguish the open and closed eyes from the 57 th frame, but when the human eye is far away from the camera from the 109 th frame, the number of black pixels in the human eye decreases regardless of the open or closed eye state, and at this time, the open and closed state of the human eye cannot be judged according to the threshold value.

In order to reduce the influence of the distance factor between the human eyes and the camera, the human eye images are normalized to the same size, the difference of the number of black pixels between two continuous frames is calculated, the eye closing action can be observed in more than two continuous frames generally, therefore when the difference value is more than two frames and less than 0, the continuous difference value is accumulated, and the accumulated difference value threshold value is set to judge the opening and closing state. However, it can be seen from fig. 13 and 14 that at frame 54, the difference is greater than 0 and is not accumulated, so that the eye-open state is erroneously recognized.

Therefore, to solve this problem, the present invention accumulates the difference using an adaptive threshold method. Two states of "state 0" and "state 1" are defined, and when the difference value of the black pixels of the binarized image of the human eye region is smaller than 0, the state is changed from "state 0" to "state 1". In the state 1, if the difference is smaller than a threshold value T (t), accumulating the difference and keeping the state unchanged; if the difference is greater than the threshold T (t), no difference is accumulated and the state changes to "state 0".

Human eye binary image region black pixel accumulated difference F based on self-adaptive threshold value_{Black colour}The calculation formula of (2) is as follows:

where n (t) is the number of black pixels of the t-th frame, Δ n (t) is the difference in the number of black pixels between the t-th frame and the t-1 th frame, D (t-1) is the accumulated difference in the number of black pixels of the t-1 th frame in "state 1", α is a constant value between 0 and 1, and the optimum α value is determined by the accuracy of detecting open and closed eyes.

Frame 54 can be correctly identified as closed eye by changing the adaptive threshold t (t) based on the accumulated difference at frame t-1. Fig. 15 is a diagram showing the result of calculating the accumulated difference of black pixels of a binary image of a human eye using an adaptive threshold, and it can be seen that the method can better identify the eye-closing state.

In order to more accurately recognize the opening and closing states of human eyes, the aspect ratio of the human eyes and the accumulated difference value of the black pixels of the human eyes are used as input parameters of the SVM classifier, and the trained classifier is used for recognizing the states of the human eyes in the image. The SVM is a machine learning algorithm which can solve the problem of two-classification and can supervise learning, and the essence of the SVM is to find a classification hyperplane with the largest interval from a classification sample point so as to enable the interval between a positive sample and a negative sample to be trained to be the largest. The algorithm can be used for classification and regression analysis of data, and solves the problems of small samples, fractional linearity, high-dimensional mathematics and the like.

The method uses the SVM classifier to perform secondary classification, and mainly comprises five parts of data selection, data processing, characteristic parameter normalization, model training and testing.

(1) Data selection

2000 open-eye samples and 1000 closed-eye samples are selected from 80 videos of the ZJU blink video data set; selecting 2000 eye opening samples and 1000 eye closing samples from the NTHU driver fatigue detection video data set; 2000 open-eye samples and 4000 closed-eye samples were collected by oneself. A total of 6000 open and closed eye sample images were taken, with and without glasses, each sample containing a human face.

(2) Data processing

Firstly, positioning key points of the human face on each sample, and then calculating the aspect ratio of human eyes and the accumulated difference value of black pixels of the human eyes, namely extracting two characteristic values of each sample.

Computing characteristic value EAR

Since the aspect ratio EAR of the eye is completely invariant to uniform scaling and rotation of the image, for each sample, after locating the key point locations of the eyes, the mean value of the aspect ratio of the eyes is directly calculated as the first eigenvalue F of the sample according to equation (38)_{Black 1}。

Calculating the cumulative difference of human eye black pixels

Because the number of black pixels in the human eye area can be changed along with the change of the distance between the human eyes and the camera, for each sample, after the right eye area is positioned according to the formula (32), the human eye area is scaled to the same size, and then the black pixel value of the human eye area is calculated. For the sample data of different experimenters, the right-eye black pixel value of the half-open eye state of the experimenters is used as the comparison value of the first frame, the black pixel accumulated difference value of the first frame of the experimenters is the black pixel value of the first frame minus the black pixel value of the half-open eye state, and the black pixel accumulated difference value of the rest sample data of the experimenters is accumulated according to the formula (39). Using the black pixel accumulated difference value as the second characteristic value F of each sample data_{Black 2}。

The open eye sample and the closed eye sample are processed separately. For each closed-eye sample image, obtaining two characteristic values according to the method and then storing the two characteristic values into a corresponding text file, wherein each behavior is a sample data, and each column is a characteristic value; the same processing as that of the closed-eye sample is performed for each open-eye sample image.

(3) Feature parameter normalization

Because the dimension difference between the numerical values of the two types of characteristic parameters extracted from each sample results in a small contribution of the characteristic parameters with small numerical values in the model training process, in order to balance the weight of each characteristic parameter in the model training process, normalization processing needs to be performed on the data of the two types of characteristic parameters:

wherein y is_iFor the normalized result value, the normalized value is in the interval [ -1,1 [ ]]Internal; x is the number of_iIs the original characteristic value, x_maxAnd x_minAre respectively x_iThe number of training samples is N.

And (3) after the text data of the sample characteristic values are obtained according to the step (2), reading the characteristic values in the two files and storing the characteristic values in a two-dimensional array, wherein each row of the array is a sample, each column is a characteristic value, and the category label corresponding to each sample is stored in a label array. Calculating the maximum value x of each column in the two-dimensional array_maxAnd minimum value x_minFor each column in the array, each eigenvalue x for that column is calculated by equation (40)_iNormalized result value y_iAnd obtaining a two-dimensional array value after the array processing is finished, namely a value obtained by normalizing all sample characteristic values.

(4) Model training and parameter optimization

The SVM classifier can be expressed as:

wherein N is the number of training samples; y is_i∈ { -1, 1} is the class label of the training sample, 1 represents closed eye, -1 represents open eye, K (x, x)_i) Representing a kernel function, constant b being a bias term, α_iBy solving a quadratic programming problem with linear constraints.

SVMs have four kernel functions: LINEAR kernel function (LINEAR), polynomial kernel function (POLY), radial basis kernel function (RBF), SIGMOD kernel function. Before classifier training, a proper kernel function needs to be selected, and since the RBF kernel function can handle the situation when the relation between the features and the classification labels is nonlinear, the RBF kernel function is adopted for model training. The RBF kernel function has two undetermined variables which are used for controlling a penalty coefficient C of a loss function and controlling a linear separability kernel parameter gamma after a nonlinear problem is transformed to a high-dimensional space, and the selection of the two variables has a decisive effect on the prediction precision.

In order to search for the optimal penalty coefficient C and the kernel variable gamma and improve the accuracy of model prediction, a K-CV cross verification method is adopted to optimize the parameters C and gamma. 8000 groups of the collected 12000 groups of characteristic values are evenly divided into 10 groups, 9 groups are selected as a training set each time, and the rest 1 group is used as a verification set. And (4) carrying out normalization processing on the characteristic values of the training set and the verification set according to the step (3), and storing the class labels in corresponding class label arrays. The optimization finds that when the parameter C is 2.04 and the parameter gamma is 0.9, the effect of model prediction classification is better.

(5) Experimental detection

(ii) evaluation of parameters in an experiment

In order to evaluate the performance of the training model for predicting the eye opening and closing state, the Accuracy (Accuracy), Precision (Precision) and Recall (Recall) are selected as evaluation parameters. For each sample of the test set, the results of the identification may appear as follows:

TP (true Positive): indicating that the test sample is predicted to be in a closed-eye state and actually is also in a closed-eye state.

FP (false Positive): indicating that the test sample predicted a closed-eye condition, and was actually an open-eye condition.

Tn (true negative): indicating that the test sample is predicted to be in an open eye state, and actually in an open eye state.

Fn (false negative): indicating that the test sample is predicted to be in an open-eye state and actually in a closed-eye state.

The three evaluation parameters were calculated as follows:

experimental results and analysis

The remaining 4000 sets of data from the sample data were selected for testing the open-closed eye status, and the test results are shown in the following table.

Table 4 open/close eye state detection results

As can be seen from table 4, the accuracy of the proposed method for identifying the open-closed eye state is high, and table 5 is a comparison of the identification results using different algorithms. Experiments show that the accuracy of the provided method for training the classifier by fusing the characteristics to the open and close states of human eyes is higher than that of the method for recognizing the state of human eyes with single characteristics.

TABLE 5 comparison of recognition results of different algorithms

The serial numbers of the positions of the characteristic points of the mouth are 48-67 according to the positioning of the characteristic points of the human face, so that the mouth can be positioned and the state can be identified according to the serial numbers of the characteristic points, as shown in fig. 16.

The mouth state is judged by calculating the Mouth Aspect Ratio (MAR), and in order to make the MAR value more accurate, as shown in fig. 16, P of the mark₁-P₁₀For calculating 10 feature points of the MAR, the calculation formula of the euclidean distance may refer to expression (43).

Under normal driving conditions, the mouth is in a closed state; when speaking with a person, the lips are in an opening and closing state which is constantly changed, and the opening amplitude is not large; and when the human body is in a fatigue yawning state, the mouth opening amplitude is large and the duration is long. In order to judge the mouth state such as speaking, yawning and the like, the state simulation is carried out by using a method based on the aspect ratio, the detection result is shown in FIG. 17, and it can be known from FIG. 17 that when MAR is less than or equal to 0.4, the mouth is closed; when the MAR is more than 0.4 and less than or equal to 0.8, the speech state is normal; it is in the yawning state when MAR > 0.8. From the above analysis, the mouth state can be identified using MAR as a feature.

4. And respectively calculating eye fatigue parameters and mouth fatigue parameters according to the eye state and the mouth state, and calculating head fatigue parameters according to the positioned characteristic point position information.

Extracting fatigue parameters according to the states of eyes, a mouth and a head, then integrating the fatigue parameters to establish a fatigue state recognition model, and judging the fatigue state of the driver by adopting a multi-feature weighted sum value. The main extracted parameters include the ratio of eye closure frame number (ECR), Blinking Frequency (BF), longest duration eye closure time (MECT), Yawning Frequency (YF), Nodding Frequency (NF), and the like.

4.1 extraction of eye fatigue information

When a person is in a fatigue state, phenomena of blink frequency increase, eye closing time increase, yawning and the like can occur, and even dozing can occur in serious cases. According to research, people normally blink 10 times to 25 times per minute, and the duration of eye closure is about 0.2_sLeft and right. According to the phenomenon, the invention selects three eye indexes which can show fatigue state most based on ECR, MECT and BF of PERCLOS criterion as eye fatigue characteristic parameters.

(1) ECR based on PERCLOS criterion

The PERCLOS criterion is the most effective and reliable criterion recognized for fatigue driving detection, which calculates the percentage of the total time that the human eye is closed over a time period. This criterion contains 3 decision criteria, depending on the definition of the closing of the eyes: EM, P₇₀And P₈₀. Wherein, P₈₀Is most suitable for identifying fatigue driving, and represents the proportion of time that the eyelid covers more than 80% of the pupil area. Because it is difficult to accurately calculate the area of the eyelid covering the pupil during actual detection, and the judgment of the Eye closing state is well realized in the foregoing, the Eye closing state judgment method takes the percentage of the Eye closing frame number in the total frame number (ECR) in the time period as the Eye characteristic parameter:

wherein n is_{Time of flight}Number of closed frames in a time period, N_{Time of flight}Is the total number of frames in the time period.

(2) Maximum duration of eye closure

Maximum Eye-closing Time (Max Eye Close Time, MECT): eyes from complete closure to completionDuration of full opening, i.e. t in fig. 18₂To t₄The elapsed time. In a fatigue state of a human, the closing time of human eyes is often more than 1.5 s. If the video speed is f per second_{Closing device}Frame, the number of closed-eye continuous frames in the time period is K_cThat is, the continuous eye closing time in one time period is:

if the continuous eye closing time in the time period exceeds the threshold value, the characteristic parameter is regarded as a fatigue state.

(3) Blink frequency

Blink Frequency (BF): number of blinks per unit time. One blink time is from t of FIG. 18₁To t₄The time elapsed for a person to blink while awake averages about 10-25 blinks per minute, and the number of blinks increases with fatigue, but decreases with distraction or severe fatigue. Therefore, the number of blinks in the time period can be counted, and if the number exceeds the normal range, the characteristic parameter is regarded as the fatigue state.

Blink detection may be performed based on the EAR value. From the EAR value calculation results, it was found that the EAR value decreased until it approached zero after one blink, and then gradually increased to a normal eye-open state value. With E_{Eye (A)}As a threshold value of EAR, K_{Eye (A)}When EAR is less than E_{Eye (A)}The threshold of one blink is counted for how many frames in succession. When EAR is less than threshold E_{Eye (A)}When the eye begins to close, it is greater than E when it approaches the normal eye-open state value_{Eye (A)}When the eyes were fully open, we counted EAR in the process<E_{Eye (A)}Number of consecutive frames F_{Eye (A)}When EAR is not less than E_{Eye (A)}When F is present_{Eye (A)}Greater than the set threshold K of the number of continuous frames_{Eye (A)}Blink once.

To find the optimum threshold value E_{Eye (A)}And K_{Eye (A)}Experiments were performed on a ZJU blink dataset, with 80 videos in ZJU containing four subjects: front video without glasses, front video with thin-frame glasses, and front video with black-frame glassesThe figures and videos of elevation up without glasses, 20 groups of videos per subject, varying from one to six blinks per video, for a total of 255 blinks in the data set.

From the results of FIG. 19, when extracting the eye fatigue parameter blink frequency, choose to calculate EAR less than threshold E_{Eye (A)}When EAR is greater than the threshold, the continuous frame number is also greater than the threshold K_{Eye (A)}And (3) recording one blink, and calculating the number of blinks in the time period to be the blink frequency.

And taking 60s as a time period, and carrying out statistical analysis on the human eye state in the period to obtain the statistical value of the eye fatigue characteristics. The waking state is represented by 0, the fatigue state is represented by 1, the longest eye closing time is mect, the ratio of the number of eye closing frames is ecr, the number of eye blinks is bf, and the fatigue threshold values among three eye fatigue characteristic values are obtained through experiments and related references as shown in the following table 6:

TABLE 6 evaluation conditions for eye fatigue

4.2 mouth fatigue parameter extraction

When the driver is in a drowsy state, the yawning can be continuously performed, the mouth opening time is kept about 6 seconds each time, and the driver needs to stop to rest at the moment and is not suitable to continue driving. Based on this phenomenon, the number of times the driver yawns in a time period can be detected to assess whether he is fatigued. From the foregoing, it can be seen that when the mouth aspect ratio MAR is greater than 0.7 for 15 consecutive frames, it is once yawned. As in fig. 20, t₁To t₄The time difference is the time of one yawning, and when the opening degree of the mouth exceeds a threshold value, whether yawning is performed or not is detected. The normal state is represented by 0, the fatigue state is represented by 1, and the value conditions of the fatigue state of the mouth are as follows:

where yf represents the number of yawns, yt is the duration of one yawning, and N is 3, and t is 4 s.

4.3 head fatigue parameter extraction

When a person is in a drowsy state, the reaction is delayed, and the control ability of the head is reduced, thereby causing the head drooping phenomenon. In order to keep waking, the head is raised continuously, so that the head is lowered and the head is raised to reciprocate up and down. When the phenomenon frequently occurs to the driver, the driver is in a fatigue state, traffic accidents are possible to occur at any time, and the detection of the nodding frequency in the driving process of the driver is a key to head motion analysis and is also an important factor for detecting fatigue driving. The driver may be considered to be in a tired state when the frequency of nodding heads exceeds a certain threshold during a time period.

According to the position information of the eye feature points, from the aspects of real-time performance and accuracy, the midpoint of the connecting line of the central points of the two eyes is taken as a head position detection point, and the nodding frequency in a time period is calculated according to the change condition of the coordinate y of the detection point along with the time in the vertical direction. FIG. 21 is a graph showing the relationship between the y value and the number of frames when the driver is dozing.

The algorithm process is as follows: when the number of video frames is large, the image can be approximately fitted into a curve, and curve extreme points are calculated, wherein the extreme points can divide the curve into a plurality of monotonous curves. The number of the extreme points, namely the number of the head-on times nf, of which the value y of the monotonously decreasing section minimum value points is greater than the initial position 50 pixels in the time period is counted through experiments; if the curve has no minimum value point, judging whether the curve is monotonically decreased, if so, setting the number nf of head points to be 1, otherwise, setting the number to be 0. The value of NF is shown as formula 47:

if the number NF of head-on times in the time period is greater than a certain threshold value N, the NF fatigue characteristic parameter value is 1, otherwise, the NF fatigue characteristic parameter value is 0, and the fatigue state detection accuracy is highest by taking N as 8 through experiments.

5. And identifying and early warning the fatigue state of the driver according to the eye fatigue parameter, the mouth fatigue parameter and the head fatigue parameter.

According to fatigue characteristic indexes of eyes, mouths and heads, weighting values are respectively taken for the accuracy of fatigue judgment, and the weighting sum of characteristic parameters is calculated as follows:

E_fatigue＝V_ECR×W₁+V_MECT×W₂+V_BF×W₃+V_NF×W₄+V_YF×W₅(48)

Wherein E_FatigueTo weight the fatigue value, V_ECRThe number of closed-eye frames, V_MECTFor the longest duration of eye closure time, V_BFFrequency of blinking, V_NFFor nodding frequency, V_YFTo beat the frequency of yawning, W_iThe weight values corresponding to the different parameters are set,

carrying out experiment optimization through simulating fatigue, determining respective weight values of five fatigue characteristic parameters of eyes, mouths and heads, wherein the weight values of the corresponding characteristics are as follows: w₁＝0.2，W₂＝0.1，W₃＝0.2，W₄＝0.2，W₅＝0.3。

According to the different values of the fatigue parameters after weighting, the states are divided into three grades: clear-headed, fatigue, severe fatigue. And integrating the weighted values and the fatigue grades of the fatigue characteristic parameters, corresponding the weighted values of the fatigue characteristic parameters to the fatigue grades, and judging the driving state of the driver according to the corresponding relation. The correspondence is shown in table 7:

TABLE 7 fatigue value and fatigue grade corresponding relation table

In order to verify the performance of the method, a verification experiment is carried out on a PC of a 64-bit operating system, a python programming language is adopted, and the experimental analysis is carried out by combining Opencv 2.4.13 and a Dlib18.17 function library. Experimental test data were from the ntuu driver fatigue test video dataset, with 5 different scenarios in the test data: the glasses are worn in the daytime, the sunglasses and the glasses are not worn, and the glasses are worn at night and the glasses are not worn. Each scene contained 16 sets of data, each set containing awake, tired, and heavily tired states.

The fatigue state of a driver is detected in a period of 60s, 165 groups of data in total are selected from each scene of 5 different scenes to find the optimal weight of each fatigue index, each weight is changed between 0.1 and 0.6, table 8 shows the influence of selection of different weights of each fatigue index part on the fatigue state identification accuracy, and the data in table 8 shows that the fatigue identification rate is highest when each fatigue index weight is the optimal value of a formula (48). The fatigue grade identification accuracy is calculated as follows:

TABLE 8 fatigue index weight optimization

Selecting the optimal weight value of each fatigue index, and identifying the fatigue state of 75 videos in the remaining 5 groups of data in each scene, wherein a table 9 is a fatigue identification result; table 10 shows the specific calculation results, fatigue values, and corresponding fatigue recognition results of the characteristic parameters of 15 videos when the glasses are worn in the daytime.

TABLE 9 fatigue recognition results under different environments

TABLE 10 fatigue recognition results for wearing glasses in daytime

As can be seen from the above table, the fatigue identification method provided has better identification accuracy in the daytime than at night, has lower identification accuracy when wearing sunglasses, but has better identification effect as a whole.

Table 11 shows the average running time of each module per frame in the method of the present invention, and the overall running time is as follows from table 11: 159.5903ms, the running time is about 17.1003ms after the face is detected. When the human face misdetection or the loss of the tracking target occurs, the detection of the next frame is immediately carried out, even if the misdetection is carried out for 3-5 seconds in a time period, the processing speed of more than 30 frames/second can be met, and the fatigue identification method has good real-time property.

TABLE 11 average run time of modules

the image acquisition and processing unit is used for acquiring a video single-frame image in real time and preprocessing the video single-frame image; the face detection and tracking unit is used for carrying out face detection on the preprocessed video image by adopting an AdaBoost algorithm based on Haar-like characteristics and tracking the detected face in real time by adopting a target tracking algorithm based on a scale space; the positioning and state recognition unit is used for positioning the feature points of the human face, respectively positioning the eye region and the mouth region according to the positioned feature points, recognizing the eye state by adopting an SVM classifier, and recognizing the mouth state by calculating the aspect ratio of the mouth; the parameter calculation unit is used for calculating eye fatigue parameters and mouth fatigue parameters according to the eye state and the mouth state respectively and calculating head fatigue parameters according to the positioned feature point position information; and the fatigue state identification unit is used for identifying and early warning the fatigue state of the driver according to the eye fatigue parameter, the mouth fatigue parameter and the head fatigue parameter.

The above disclosure is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or modifications within the technical scope of the present invention, and shall be covered by the scope of the present invention.

Claims

1. A fatigue driving identification method based on multiple characteristics is characterized by comprising the following steps:

2. The fatigue driving identification method according to claim 1, wherein in the step 1, the video single-frame image preprocessing process comprises:

3. The fatigue driving identification method according to claim 2, wherein in the step 1.1, the video image is subjected to smoothing and denoising processing by using adaptive median filtering.

4. The fatigue driving identification method according to claim 2, wherein in step 1.2, an illumination equalization algorithm based on a dynamic threshold is adopted to perform illumination compensation processing on the video images with different illumination shades.

5. The fatigue driving recognition method according to claim 1, wherein in the step 2, the specific operation steps of the AdaBoost algorithm for face detection are as follows:

6. The fatigue driving recognition method according to claim 1 or 5, wherein in the step 2, a target tracking algorithm based on a scale space is adopted to track the detected face in real time, and the specific operation steps are as follows:

Sum scale model

Using characteristic samples

And the position model of the previous frame

Then using the feature sample

Sum scale model

Sum scale model

Otherwise in the current frame I_tCarrying out face detection again;

the updating strategy conditions are that the maximum response value and the average peak correlation energy are respectively larger than the ratio β₁Sum ratio β₂，β₁Is 0.7, β₂Is 0.45;

7. The fatigue driving recognition method according to claim 1, wherein in the step 3, a cascade regression tree-based algorithm is adopted to locate the feature points of the human face, wherein the feature points of the human face include eye feature points and mouth feature points.

8. The fatigue driving recognition method according to claim 1 or 7, wherein in the step 3, the specific operation of recognizing the eye state by using the SVM classifier is:

training an SVM classifier by taking the human eye aspect ratio and the human eye binary image area black pixel accumulated difference value as input characteristics of the SVM classifier, and then classifying and identifying the eye state by adopting the trained SVM classifier; the black pixel accumulated difference value F of the human eye binary image area_{Black colour}The calculation formula of (2) is as follows:

9. The fatigue driving identification method according to claim 1, wherein in the step 4, the eye fatigue parameters include a ratio of eye closure frame number, a blinking frequency, and a maximum duration eye closure time, the mouth fatigue parameters include a yawning frequency, and the head fatigue parameters include a nodding frequency; preferably, the fatigue state is identified by performing weighted summation on the eye fatigue parameter, the mouth fatigue parameter and the head fatigue parameter, and the specific weighted summation expression is as follows:

E_fatigue＝V_ECR×W₁+V_MECT×W₂+V_BF×W₃+V_NF×W₄+V_YF×W₅

10. A multi-feature based fatigue driving recognition system, comprising: