CN115050012A

CN115050012A - Method for detecting fatigue driving of driver wearing mask based on lightweight model

Info

Publication number: CN115050012A
Application number: CN202210660238.1A
Authority: CN
Inventors: 刘强; 谢谦; 郑国鸿
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-09-13

Abstract

The invention relates to the technical field of fatigue driving detection, in particular to a method for detecting fatigue driving of a driver wearing a mask based on a lightweight model; the method comprises the following steps: s1: acquiring a face image of a driver, and preprocessing the face image; s2: the face image data are transmitted into a lightweight GAN, and the image quality is enhanced; s3: the method comprises the steps of transmitting image data of a driver face into a lightweight target detection network, extracting features through an improved GhostNet main feature network, combining an SPP structure and a PANet structure to perform feature fusion, and finally classifying and regressing by using a Yolo Head to obtain states and position coordinates of eyes, pupils and heads of the driver; s4: judging whether the driver is in a fatigue driving state or not by integrating the percentage of eyes closed, the moving speed of sight line and the nodding frequency of the driver; compared with the traditional method of increasing the number of network layers and integrating multiple backbones, the method of the invention adopts the channel shuffling and division and the method of integrating multiple backbones, thereby improving the speed and the precision of detection.

Description

Method for detecting fatigue driving of driver wearing mask based on lightweight model

Technical Field

The invention relates to the technical field of fatigue driving detection, in particular to a method for detecting fatigue driving of a driver wearing a mask based on a lightweight model.

Background

Research has shown that more than 20% of many traffic accidents each year are accidents caused by fatigue driving, which has become one of the main causes of driving accidents.

The driver fatigue detection is the most efficient and objective method for analyzing the fatigue state of the driver at present, has the characteristic of non-subjectivity, distinguishes the history that the fatigue state of the driver needs to be sensed by the own physiology of the driver, and can judge the fatigue state of the driver more accurately and objectively. The fatigue state of the driver is more accurate and direct, so that traffic accidents caused by fatigue of the driver can be reduced to a certain extent.

The existing fatigue driving detection is mostly established on the basis of obtaining human face feature points, for example, chinese patent application with application number CN2021115482505 provides a driver multi-index fatigue driving detection method based on image recognition, the method obtains 68 feature points of a human face through a Histogram of Oriented Gradient (HOG) human face detection algorithm and a CE-CLM model, then selects related feature point information, can obtain eye opening and closing state, sight line direction and head steering information of a driver through conversion and calculation of a coordinate system, sets a fatigue standard and a threshold value, and compares the standard and the threshold value with the information of the index to judge the fatigue degree of the driver. However, this method relies too much on positioning the face, and cannot maintain good robustness in cases where the face is occluded, such as bang, sunglasses, masks, etc.

The Chinese patent application with the application number of CN2021115900351 provides a driver fatigue detection method and a system, the method can realize fatigue detection when a driver wears a mask, and whether the driver is in a fatigue driving state can be judged to a great extent by calculating the opening and closing degree of eyes and mouth of the driver, so that effective early warning is realized. However, the method transmits the face image with the mask to the GAN to generate the face image without the mask, and has the problems of large parameters and long training time of the GAN model, so that the GAN model cannot be well deployed on vehicle-mounted mobile detection equipment. And whether the driver is in a fatigue driving state is judged only through the opening and closing degree of the eyes and the mouth of the driver, so that the problem of low recognition accuracy exists.

Disclosure of Invention

The invention provides a method for detecting fatigue driving of a driver wearing a mask based on a lightweight model, aiming at solving the problems in the prior art.

In order to achieve the functions, the technical scheme provided by the invention is as follows:

a fatigue driving detection method for drivers wearing masks based on lightweight models is characterized by comprising the following steps: the method comprises the following steps:

s1: acquiring a face image of a driver in a driving state, and preprocessing the face image;

s2: transmitting the facial image data of the driver into a lightweight GAN, and enhancing the image quality;

s3: the enhanced driver face image data are transmitted into a lightweight target detection network, feature extraction is carried out through an improved GhostNet main feature network, feature fusion is carried out by combining an SPP structure and a PANet structure, and finally, a Yolo Head is used for classification and regression to obtain the state and position coordinates of the eyes, the pupils and the Head of a driver;

s4: and judging whether the driver is in a fatigue driving state or not by integrating the percentage of eyes closed, the moving speed of the sight line and the nodding frequency of the driver.

Preferably, in step S2, the lightweight GAN includes a generator constructed based on the shuffle netv2 and a discriminator constructed based on the patch GAN, and performs low-light improvement and deblurring operations on the input video stream to generate a well-lighted and sharp image.

Preferably, the driver face images are preprocessed in step S1 and uniformly adjusted to 224 × 224 × 3;

in step S3, the process of extracting features through the improved GhostNet backbone feature network is as follows:

the method comprises the steps of enabling an input 224 x 3 image to pass through a 16-channel common 1 x 1 convolution block to obtain a 7 x 160 feature layer, adjusting the number of channels by using a 1 x 1 convolution block to obtain a 7 x 960 feature layer, finally performing global average pooling and 1 x 1 convolution to obtain a 1 x 1280 feature layer, performing full-connection classification, combining and connecting two identical GhostNet trunks, enabling the output of each stage of a first trunk to serve as a part of input, enabling the output to flow to a next trunk network in a parallel stage in a mode of adjacent high-level combination, using depth separable convolution to generate a redundant feature map, and enabling channels to shuffle and divide to reduce calculation amount, and obtaining three effective feature layers through a trunk feature extraction network, wherein the sizes of the three effective feature layers are 76 x 256, 38 x 512 and 19 x 1024 respectively.

Preferably, in step S3, the eyes are individually detected, and the PERCLOS value is calculated according to the proportion of time taken by the driver for more than 80% of the area of the eyelid covering the pupil in unit time.

Preferably, in step S3, the movement speed of the eye is calculated by calculating the distance change between the pupil center coordinates and the eye center coordinates, and the nodding frequency is calculated by counting the nodding times per unit time.

Preferably, the SPP structure is used for three effective characteristic layers to increase the receptive field, the salient features are separated, repeated feature extraction is carried out by using PANet, and the sizes of the three characteristic layers are 76 × 76 × 33, 38 × 38 × 33 and 19 × 19 × 33 respectively;

and finally, transmitting a Yolo Head for prediction and decoding, transforming the three feature layers into 76 × 76 × 11, 38 × 38 × 11 and 19 × 19 × 11, wherein 11 in the last dimension respectively represents x _ offset, y _ offset, h, w, confidence and a classification result, adding x _ offset and y _ offset corresponding to each grid point to obtain the center of a prediction frame, and calculating the length and width of the prediction frame by using a priori frames and h and w to obtain the state and position coordinates of the eyes, the pupils and the Head of the driver.

Preferably, in steps S2 and S3, the data set employed by the lightweight object detection network is derived from a GO PRO data set, a ZJU data set, a YawDD data set, and an actual shooting homemade data set.

Preferably, the method further comprises the following steps:

s5: and if the driver is detected to be in a fatigue state, a warning signal is sent out.

The invention has the beneficial effects that:

1. compared with the traditional method for labeling 68 characteristic points of the face through a Dlib library, the method provided by the invention has the advantages that a target detection network is used, the face positioning step is skipped, and the states and position coordinates of eyes, pupils and the head are directly detected, so that the detection process can be simplified, and the situations of missing detection caused by face loss due to the fact that a driver wears a mask, turns the head and lowers the head can be avoided;

2. the light-weight GAN and GhostNet backbones are used, and deep separable convolution is used, so that the network parameter is greatly reduced, the model is small, the real-time performance is high, and the method is suitable for being deployed on vehicle-mounted mobile equipment;

3. in the detection of human eyes and pupils, the invention introduces a laminated halation enhancement method on the data level; a plurality of same trunks are integrated on the network layer, so that the detection effect on human eyes and pupils is improved; the monocular detection is used for solving the problem that eyes cannot be detected when the face turns;

4. the GAN is constructed aiming at the problems of dark and fuzzy images in the driving environments of low illumination, bump and the like, so that the image quality is improved, and the robustness of the model is further improved.

Drawings

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is a schematic diagram of a target detection network;

FIG. 3 is a schematic diagram of a cascading vignetting enhancement method in a target detection network;

FIG. 4 is a schematic diagram of eye and position detection in the present invention;

FIG. 5 is a schematic diagram of the pupil state and position detection in the present invention;

fig. 6 is a schematic diagram of head state and position detection in the present invention.

Detailed Description

The invention will be further elucidated with reference to the accompanying figures 1 to 6:

a wearing mask driver fatigue driving detection method based on a lightweight model comprises the following steps:

s2: transmitting the driver face image data into a lightweight generated countermeasure Network (GAN for short) to enhance the image quality;

s3: the enhanced driver face image data is transmitted into a lightweight target detection Network, feature extraction is carried out through an improved GhostNet main feature Network, feature fusion is carried out by combining a Spatial Pyramid Pooling (SPP) structure and a Path Aggregation Network (PANET) structure, and finally a detection Head (Yolo Head) is used for classification and regression to obtain the state and position coordinates of the eyes, the pupils and the Head of a driver;

s4: judging whether the driver is in a fatigue driving state or not by integrating the Percentage of closed eyes of the driver (PERCLOS), the over Time, the sight line moving speed and the head-nodding frequency;

s5: if the driver is detected to be in a fatigue state, a warning signal is sent out.

The above-mentioned overall detection process is shown in FIG. 1.

In step S1, the camera adopts a camera with auto-focusing frame rate of 60fps/S, which is different from a general camera with fixed-focus frame rate of 30fps/S, and the auto-focusing function is used to ensure that the driver' S face is focused as far as possible during driving, so as to provide clear images and details for the network; the higher frame rate is used for ensuring that twice real-time face images are provided for the network in unit time as much as possible in the driving process, more face details are captured, the possibility of smear is reduced, and the identification accuracy is improved.

In step S2, a lightweight GAN (generic adaptive Network, generation countermeasure Network) is composed of a generator constructed based on shuffle netv2 and a discriminator constructed based on patch GAN, so that GAN reduces the amount of calculation while maintaining good performance, performs low-light modification and deblurring operations on the input video stream, and generates a well-illuminated and clear image, thereby improving the image quality.

In step S2, the lightweight augmented Network (gen-versus-Network) training process includes the generator generating a well-illuminated and clear image by receiving low-light and blurred image samples, the discriminator receiving the image generated by the generator to perform true and false determination, and repeating the iteration until the loss function of the generator and the discriminator reaches a minimum value. The data set adopted by the lightweight target detection network is derived from a GO PRO data set, a ZJU data set, a YawDD data set and an actual shooting self-making data set.

In step S3, when performing detection, the input video stream is framed by the network, and the picture is subjected to feature extraction in the backbone network, and the feature extracted in this process, i.e., the feature set of the framed picture, is referred to as a feature layer. The images obtained in steps S1 and S2 are normalized to 224 × 224 × 3. The backbone feature extraction network of the target detection network is constructed based on GhostNet, as shown in fig. 2, that is, when the input is 224 × 224 × 3, 7 × 7 × 160 feature layers are obtained by a 16-channel general 1 × 1 convolution block, then the number of channels is adjusted by using the 1 × 1 convolution block to obtain 7 × 7 × 960 feature layers, finally global average pooling and 1 × 1 convolution are performed to obtain 1 × 1 × 1280 feature layers for full-connection classification, the number of single sub-backbone network layers is reduced but a plurality of same backbone networks are integrated, that is, two same GhostNet backbone combinations are connected, the output of each stage of the first backbone is used as a part of the input, the output flows to the parallel stage of the next backbone network by means of Adjacent High Level Combination (AHLC), a redundant feature map is generated by using deep separable convolution, and the computation amount is reduced by channel shuffling and segmentation, namely, the number of channels of 1/3 is artificially reduced at the stage of the backbone network, the order of the channels is disordered, and three effective feature layers with the sizes of (76 × 76 × 256), (38 × 38 × 512) and (19 × 19 × 1024) are obtained through the backbone feature extraction network. Three effective characteristic layers are obtained through the trunk characteristic extraction network, then the receptive field is enlarged by using the SPP structure, the obvious characteristics are separated, repeated characteristic extraction is carried out by using the PANet, and characteristic enhancement extraction, namely characteristic fusion, is carried out. The three feature layer sizes obtained finally are (76 × 76 × 33), (38 × 38 × 33) and (19 × 19 × 33), the Yolo Head is finally transmitted for prediction and decoding, the three feature layers are transformed into (76 × 76 × 11), (38 × 38 × 11) and (19 × 19 × 11), and 11 in the last dimension includes 4+1+6, which respectively represents x _ offset, y _ offset, h and w, confidence and classification result. The classification result comprises six labels of opening eyes, closing eyes, pupils, head-up and head-down respectively. And adding the x _ offset and the y _ offset corresponding to each grid point to obtain the center of the prediction frame, and calculating the length and the width of the prediction frame by combining the prior frame with h and w to obtain the state and the position coordinates of the eyes, the pupils and the head of the driver.

It should be noted that, the number of layers of a single sub-backbone network is reduced, but a plurality of same backbone networks are integrated, and a deep separable convolution is used to generate a redundant feature map, channel shuffling and segmentation to reduce the calculation amount, that is, the number of channels of 1/3 is artificially reduced and the sequence of the channels is disturbed at the stage of the backbone network, compared with a traditional target detection framework, the size of a model is greatly reduced while the performance of the model is maintained and improved in small steps, and the parameter amount is reduced by about 63%, so that the model is lighter, the deployment at a mobile end and an embedded end can be further met, and the requirement of fatigue detection real-time performance is met. Meanwhile, the target face features are subjected to unified processing, and a same set of frame and detection method is used, so that various different face features such as eye, mouth, nose, head postures and the like can be detected, face information is fully utilized, the detection efficiency is improved, the detection process is simplified, and an upgrade space is reserved.

In step S3, the backbone network uses the mesh activation function:

Mish＝x*tanh(ln(1+e)) (1)

other modules use the Leaky Relu activation function:

Leaky Relu(x)＝max(αx,x) (2)

wherein α is 0.01.

The target detection network obtains the state and position coordinates of the eyes, pupils and the head of a driver, wherein the eyes are separately detected to obtain the eye state of more than 80% closed or open, the PERCLOS value is calculated according to the time proportion occupied by the area of the eyelids of the driver covering the pupils of more than 80% in unit time, the ratio of the number of frames of the eyes of more than 80% closed to the total number of frames is embodied in the system, the sight line moving speed is calculated by calculating the distance change between the center coordinates of the pupils and the center coordinates of the eyes, and the nodding frequency is calculated by counting the nodding times in unit time. The target detection network is used for respectively and independently detecting the two eyes, so that the problem of side face state target loss caused by traditional feature point positioning is avoided, eye feature acquisition is not influenced by angles, the omission ratio and the false alarm ratio are reduced, and the system stability is improved.

In the above S4, the target detection network skips the face and directly searches for the fatigue feature target in the occluded face, as shown in fig. 4, 5, and 6, fig. 4 shows that the eye closing action is recognized; FIG. 5 illustrates the identification of pupil movement; fig. 6 shows recognition of the nodding and pitching motion. The system judges whether the driver is in a fatigue state or not according to a fatigue judgment rule, the fatigue judgment rule is a four-level fatigue judgment rule which is formulated by combining the eye state, the sight line moving speed and the head movement state, and the judgment rule is not fatigue, is slightly fatigue, is fatigue and is severely fatigue, and the judgment rule is shown as the following table:

in S5, when it is detected that the driver is in a tired state, a warning sound is emitted through a speaker of the in-vehicle portable device until the driver is awake. When the driver is awake, the warning sound stops in time, and the driver is reminded to stop at the side or search a service area and a rest area nearby for rest in time.

As shown in fig. 3, for the problem that the target detection network has poor detection effect on small targets, i.e., human eyes and pupils, the method adopts the laminated halation enhancement, i.e., splicing by random zooming, random clipping and random arrangement is added, and six pictures are weighted, fused and arranged, specifically, the method comprises the steps of reading six pictures each time, respectively turning over, zooming, color gamut variation and the like on the six pictures, placing the pictures in six different directions, and enhancing the contrast around the eyes to protrude the eyes by processing a data set, so that the detection on the targets of the human eyes and the pupils is greatly improved.

Claims

1. A wearing mask driver fatigue driving detection method based on a lightweight model is characterized by comprising the following steps: the method comprises the following steps:

2. The method for detecting fatigue driving of a driver wearing a mask according to claim 1, wherein the method comprises: in step S2, the lightweight GAN includes a generator constructed based on ShuffleNetV2 and a discriminator constructed based on PatchGAN, and performs low-light modification and deblurring operations on the input video stream to generate a well-illuminated and clear image.

3. The method for detecting fatigue driving of a driver wearing a mask according to claim 1, wherein the method comprises:

the face images of the driver are preprocessed in step S1 and uniformly adjusted to 224 × 224 × 3;

4. The method for detecting fatigue driving of a driver wearing a mask according to claim 1, wherein the method comprises: in step S3, the eyes are individually detected, and the PERCLOS value is calculated based on the proportion of time taken by the driver for more than 80% of the area of the eyelid covering the pupil per unit time.

5. The method for detecting fatigue driving of a driver wearing a mask according to claim 1, wherein the method comprises: in step S3, the movement speed of the eye is calculated by calculating the distance change between the pupil center coordinate and the eye center coordinate, and the nodding frequency is calculated by counting the number of nodding times in unit time.

6. The method for detecting fatigue driving of a driver wearing a mask according to claim 3, wherein the method comprises: using SPP structure to increase the receptive field of the three effective characteristic layers, separating out salient features, and using PANet to perform repeated feature extraction, wherein the sizes of the three obtained characteristic layers are 76 × 76 × 33, 38 × 38 × 33 and 19 × 19 × 33 respectively;

7. The method for detecting fatigue driving of a driver wearing a mask according to claim 1, wherein the method comprises: in steps S2 and S3, the data set used by the lightweight object detection network is derived from the GO PRO data set, the ZJU data set, the YawDD data set, and the actual shooting homemade data set.

8. The method for detecting fatigue driving of a driver wearing a mask according to claim 1, wherein the method comprises: further comprising the steps of: