GB2621863A

GB2621863A - Pose classification and in-cabin monitoring methods and associated systems

Info

Publication number: GB2621863A
Application number: GB2212345.9A
Authority: GB
Inventors: Horst Meier Matthias; Das Mithun; Li Lei; Kumar Thakur Sunil; Sanyal Saptak
Original assignee: Continental Automotive Technologies GmbH
Current assignee: Continental Automotive Technologies GmbH
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2024-02-28
Also published as: GB202212345D0; WO2024041790A1

Abstract

Method for determining vehicle occupant’s face or head pose, comprising: extracting key point coordinates data 18 indicative of a body part from an image 14; determining a head or face region in the image from the key point data and cropping the face or head region from the image 20 to obtain a cropped face image 22; classifying 24 the cropped face or head image to determine pose class 26. The vehicle occupant may be a vehicle driver. Key point data may be extracted by a machine learning model (convolutional neural network) which also provides a key point label selected from the following groups; hands, elbows, shoulders, nose, eyes, ears, mouth and chin. The face/head pose class may be determined by a machine learning model (artificial neural network). A body pose class 30 may be obtained by classifying the key point data 18 by a tree-based machine learning model 28. A control means may use the duration that a face/head classification 26 and body pose classification 30 are displayed to determine a control signal for the vehicle.

Description

DESCRIPTION

Pose classification and in-cabin monitoring methods and associated systems

TECHNICAL FIELD

The invention relates to computer-implemented methods for classifying face/head poses and in-cabin monitoring of vehicle occupants.

BACKGROUND

Driver assistance systems for vehicle drivers have become more sophisticated recently and increasingly rely on machine learning models, feature extraction and classification for monitoring the driver or other occupants of the vehicle. The systems are able to detect certain states, such as fatigue, drowsiness, consciousness of the driver, for example, and may generate control signals to the vehicle. These may reach from a simple warning message to the driver, or broadcasting a hazardous state to other traffic participants up to an emergency brake procedure to prevent collisions with other vehicles or persons.

US 10 089 543 B2 discloses a computer-implemented method for detecting a head pose in a vehicle that includes receiving images of a vehicle occupant located in the vehicle from an imaging device and selecting facial feature points from a plurality of facial feature points extracted from the images. The method includes calculating a head pose point based on normalizing the selected facial feature points, determining the head pose based on a change in position of the head pose point over a period of time T, and controlling one or more vehicle systems of the vehicle based on the head pose.

Long Chen et al., "Driver Fatigue Detection Based on Facial Key Points and LSTM", Security and Communication Networks, Volume 2021, Article ID 5383573, 9 pages, https://doi.org/10.1155/2021/5383573, published 14 June 2021, discloses a fatigue state recognition algorithm based on a multitask convolutional neural network (MTCNN) to detect human face; subsequently an open-source software library, such as DLIB is used to locate facial key points to extract a fatigue feature vector of each frame.

EP 3 690 729 Al discloses a method for warning by detecting an abnormal state of a driver of a vehicle based on deep learning. The method includes steps of a driver state detecting device inputting an interior image of the vehicle into a drowsiness detecting network, to detect a facial part of the driver, detect an eye part from the facial part, detect a blinking state of an eye to determine a drowsiness state, and inputting the interior image into a pose matching network, to detect body key points of the driver, determine whether the body key points match one of preset driving postures, to determine the abnormal state.

SUMMARY OF THE INVENTION

It is the object of the invention to improve in-cabin monitoring systems for vehicles.

The object is achieved by the subject-matter of the independent claims. Preferred embodiments are subject-matter of the dependent claims.

The invention provides a computer-implemented method for determining a face/head pose class of a vehicle occupant from unlabeled image data that at least partially includes the vehicle occupant, the method comprising: a) capturing image data that at least partially include the vehicle occupant; b) extracting key point data from the image data, wherein the key point data include at least one key point, that is indicative of a specific body part of the vehicle occupant, and key point coordinates for each key point; c) determining from the key point data obtained in step b) a face/head region within the image data, wherein the face/head region includes a face portion and/or head portion of the vehicle occupant, and cropping the face/head region from the image data in order to obtain cropped image data that only include the face/head region; d) determining a face/head pose class by classifying the cropped image data into one output face/head pose class of a predetermined set of face/head pose classes.

Preferably, the vehicle occupant is a driver of the vehicle.

Preferably, in step b) the key point data is extracted by a machine learning model that includes a convolutional neural network.

Preferably, in step b) each extracted key point is labelled with a body part label chosen from a group comprising or consisting of hands, elbows, shoulders, nose, eyes, ears, mouth, chin.

Preferably, in step d) the face/head pose class is determined by a machine learning model that includes an artificial neural network that is trained to determine a probability score for each face/head pose class of the predetermined set, and the face/head pose class with the highest probability is selected as the output face/head pose class.

Preferably, the method includes the step: e) determining a body pose class by classifying the key point data into one output body pose class of a predetermined set of body pose classes.

Preferably, in step e) the body pose class is determined by a tree-based machine learning model that is configured to determine the output body pose class based only on the key point data.

The invention provides an in-cabin monitoring method for monitoring at least one vehicle occupant within a vehicle, the method comprising: a) performing a previously described method; b) evaluating the output face/head pose class and optionally the output body pose class and generating a control signal for the vehicle based on the evaluation of the respective output pose class.

Preferably, in step b) the evaluation includes a time measurement of how long a specific output face/head pose class and optionally output body pose class are displayed by the vehicle occupant, and generating the control signal also on the time measurement.

Preferably, the control signal brings the vehicle into a safe state by slowly driving the vehicle to a side or (hard) shoulder of the road.

Preferably, the control signal causes the vehicle to perform an emergency braking procedure The invention provides a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out a previously described method.

The invention provides a computer-readable medium having stored thereon or a data carrier signal carrying the computer program.

The invention provides a classification device that is configured for determining an output face/head pose class of a vehicle occupant from unlabeled image data that at least partially includes the vehicle occupant, the device comprising: a) an imaging device configured for capturing image data that at least partially include the vehicle occupant; b) a key point extraction means that is configured for extracting key point data from the image data, wherein the key point data include at least one key point, that is indicative of a specific body part of the vehicle occupant, and key point coordinates for each key point; c) an image data cropping means that is configured for determining from the key point data obtained in step b) a face/head region within the image data, wherein the face/head region includes a face portion and/or head portion of the vehicle occupant, and cropping the face/head region from the image data in order to obtain cropped image data that only include the face/head region; d) a face/head pose classification means configured for determining a face/head pose class by classifying the cropped image data into one output face/head pose class of a predetermined set of face/head pose classes.

Preferably, the key point extraction means include a machine learning model that includes a convolutional neural network that is configured for extracting the key point data.

Preferably, the key point extraction means are configured to label each extracted key point with a body part label chosen from a group comprising or consisting of hands, elbows, shoulders, nose, eyes, ears, mouth, chin.

Preferably, the face/head classification means are configured to determine the face/head pose class by using a machine learning model that includes an artificial neural network that is trained to determine a probability score for each face/head pose class of the predetermined set, and the face/head pose class with the highest probability is selected as the output face/head pose class.

Preferably, the device comprises e) a body pose classification means configured for determining a body pose class by classifying the key point data into one output body pose class of a predetermined set of body pose classes.

Preferably, the body pose classification means are configured to determine the body pose class by using a tree-based machine learning model that is configured to determine the output body pose class based only on the key point data.

The invention provides an in-cabin vehicle monitoring system configured for monitoring at least one vehicle occupant within a vehicle, the device comprising: a) means for performing a preferred classification method; b) control means configured for evaluating the output face/head pose class and optionally the output body pose class and generating a control signal for the vehicle based on the evaluation of the respective output pose class.

Preferably, the control means are configured for performing a time measurement of how long a specific output face/head pose class and optionally output body pose class are displayed by the vehicle occupant, and for generating the control signal also on the time measurement.

Preferably, the control signal causes the vehicle to perform an emergency braking procedure.

This solution can be used to detect the state of the driver of a vehicle based on his body and face/head pose. The body and head poses are extracted from an image, e.g., from a near-infrared (NIR) or RGB camera. The idea includes three different machine learning (ML) models.

One model is configured to detect body key points of the driver inside the image. The key point model can be based on a convolutional neural network (CNN) that is trained to output key point confidence maps and part affinity fields. In a postprocessing step, the confidence maps and part affinity fields may be used to calculate the body key point coordinates relative to the input image size. The output of the key point model are typically the coordinates of certain body key points, e.g., hands, elbows, shoulders and also facial/head key points for nose, eyes, ears and neck.

An example for a key point extraction model is the Qualcomm pose estimation (TensorFlow) that is available on GitHub: https://github. com/quic/aimet-modelzoo/blob/develop/zoo_tensorflow/Docs/PoseEstimation.md The model takes input data from images with size 224x400x3, normalized by: (x/256)-0.5, i.e., pixel values from -0.5 to 0.5. An example for suitable training parameters include optimizer: Adam, learning rate: 0.001, mini batch size: 16, epochs: 10.

The model is preferably configured as a two-branch multi-stage CNN, as is known from Cao et al., "Rea!time Multi-Person 20 Pose Estimation using Part Affinity Fields", arXiv: 1611.08050v2. Each stage in the first branch predicts confidence maps, and each stage in the second branch predicts affinity fields. After each stage, the predictions from the two branches, along with the image features, are concatenated for next stage. This architecture is able to simultaneously predict detection confidence maps and affinity fields that encode part-to-part association.

Each branch is preferably configured as an iterative prediction architecture, which can refine predictions over successive stages, preferably with intermediate supervision at each stage. The image is first analyzed by a convolutional network generating a set of feature maps F that is input to the first stage of each branch. At the first stage, the network produces a set of detection confidence maps and a set of part affinity fields. In each subsequent stage, the predictions from both branches in the previous stage, along with the original image features, are concatenated and used to produce refined predictions.

The model is preferably trained with the COCO dataset (https://cocodataset.org/) that includes images labeled with body and face key points. The key point extraction model generates part affinity fields and heatmaps. These are postprocessed by applying non maximum suppression to the heatmaps and assign each detected key point in the heatmap to a specific person by applying the Hungarian algorithm to part affinity fields. The result of the key point extraction model are (20) key point coordinates (e.g. nose, eyes, etc.), i.e. the positions of the key points within the input data image.

Another idea is a model for classifying the key points into a specific body pose. The body pose classification model may take some or all of the key point coordinates as input features. The body classification can be a tree based ML model, for example XGBoost or Random Forest, that can be trained to classify different body poses, e.g. body normal, body leaning left or body leaning right. The architecture, as is known, of the XGBoost tree model is basically a series of if...then.else statements that are true for a particular body pose defined by the keypoints.

The body pose classification model preferably takes as input data the (2D) key point coordinates. An example model architecture is the XGBoost tree model. The model is trained on the key point coordinates extracted from the COCO dataset. It is also possible to use a different dataset. Each collection of key point coordinates in a single image is labelled with a pose class. Suitable training parameters for the model include max_depth: 5, min_child_weight: 1, learning_rate: 0.25, subsample: 0.85, colsample_bytree: 0.45, gamma: 0.4, reg_alpha: 0.08, n_estimators: 300.

The model outputs a probability score for each possible body pose class. It is preferred that the body pose class with the highest probability score is selected and output as the output body pose class.

To detect the face/head pose, first the facial and head key point coordinates from the output of the key point model are reused to calculate the region where the drivers face is located. This region is cropped from the original image and then used as the input to a CNN classifier. The CNN classifier can be trained to detect different face/head poses, e.g., face looking up, face looking straight, face looking down or face looking sidewards.

The face classification model takes as input data the image pixel values of the face/head region that was cropped using the key points. The pixel values may be normalized by (x/127.5)-1.0, i.e., from -1 to 0. The face classification model uses MobileNet V2, for example, which is publicly available under tri The architecture of MobileNet V2, as is known from Sandler et al., "MobileNetV2: Inverted Residuals and Linear Bottlenecks", arXiv:1801.04381v4, has a basic building block of a bottleneck depth-separable convolution with residuals. The architecture of MobileNetV2 contains the initial fully convolution layer with 32 filters, followed by 19 residual bottleneck layers. Preferably, ReLU6 is used as the non-linearity because of its robustness. The kernel size is chosen to be 3x3. During training, dropout and batch normalization can be utilized.

The model is trained using face images that are (manually) cropped from the COCO dataset. Again other training sets are possible. An example for a set of suitable training parameters includes Optimizer: Adam, learning rate: 0.00005, mini batch size: 16, epochs: 25. The model outputs a probability score for each possible face/head pose class. The face/head pose class with the highest probability score may be selected to be output as the output face/head pose class.

The body pose and the face/head pose that were separated determined are then combined to get a more comprehensive estimation of the driver state. For example, if the body is leaning sidewards for a short period of time, but the face is still looking straight, it could be still considered a normal driving position.

With this configuration the output of the key point model is reused to detect the face. Consequently, a separate face detector can be avoided thereby allowing to save time and computing resources. Furthermore, the classification involves the full face/head instead of only certain parts, such as the eyes. The accuracy of the driver state detection may be increased. In addition, the driver state may be detected more robustly.

A driver monitoring system that runs the system described above may warn the driver in case of not being in a normal driving condition or position. The system may also intervene in the control of the vehicle, e.g., triggering an emergency brake.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are described in more detail with reference to the accompanying schematic drawings.

The only Fig. depicts an embodiment of an in-cabin monitoring system of a vehicle.

DETAILED DESCRIPTION OF EMBODIMENT

Referring to the Fig., an in-cabin monitoring system 10 for a vehicle is depicted. The monitoring system 10 is generally configured to monitor the state of a vehicle occupant based on body pose, head/face pose and additional information, such as time duration etc. The monitoring system 10 comprises an imaging device 12. The imaging device 12 is configured to capture image data 14 of a vehicle occupant, e.g., the driver. The imaging device 12 may be an RGB or N IR camera. The imaging device 12 is typically arranged to capture an upper body portion including the head and face of the vehicle occupant.

The monitoring system 10 comprises a key point extraction means 16. The key point extraction means 16 receive the image data 14 captured by the imaging device 12. The key point extraction means 16 include a machine learning model that is trained to determine key point confidence maps and part affinity fields from the image data 14. The machine learning model is preferably configured as a convolutional neural network that is known per se.

Each key point is characterized by key point coordinates, that are indicative of the position of the key point within the image data 14, and a key point label, that is indicative of the body part that the key point represents, such as hands, elbows, shoulders and also facial/head key points for nose, eyes, ears and the like. The key point extraction means 16 is configured to output key point data 18 that includes the respective key point coordinates and the corresponding key point label for each key point.

The monitoring system 10 comprises an image data cropping means 20. The image data cropping means 20 receives the image data 14 captured by the imaging device 12 and the key point data 18 that was extracted by the key point extraction means 16. The image data cropping means 20 determines a face/head region in the image data 14 based on the key point data 18, wherein the face/head region contains the face of the vehicle occupant. The image data cropping means 20 crops the face/head region out of the captured image data 14 in order to obtain cropped image data 22.

The monitoring system 10 comprises a face/head pose classification means 24. The face/head pose classification means 24 receive the cropped image data 22. The face/head pose classification means 24 includes a machine learning model that is trained to classify the face/head pose of the vehicle occupant by determining a probability score for each member of a predetermined set of face/head pose classes. The face/head pose classification means 24 selects the face/head pose with the highest probability score as the output face/head pose class 26 that the vehicle occupant had at the time of capturing the image data 14.

The monitoring system 10 comprises a body pose classification means 28. The body pose classification means 28 processes only the key point data 18 that was determined by the key point extraction means 14, i.e., no image data is processed. The body pose classification means 28 includes a tree based machine learning model that is configured to classify the body pose of the vehicle occupant at the time of capturing the image data 14 and outputting it as output body pose class 30.

The monitoring system 10 comprises a control means 32. The control means 32 receive the output face/head pose class 26 and the output body pose class 30. The control means 32 may determine, for how long a specific face/head pose class 26 and/or body pose class 30 is exhibited by the vehicle occupant. The control means 32 may include a database of combinations of pose classes 26, 30 and associated time durations that are considered an abnormal or hazardous state of the vehicle occupant. The control means 32 evaluates the pose classes 26, 30 that are received with the stored combinations and performs a predetermined action, when the control means 32 determines that an abnormal or hazardous state is present. The control means 32 may issue a control signal that causes the vehicle to issue a warning, e.g., for the driver or other vehicle occupants using the interior signaling devices, and/or other traffic participants, e.g., by using the vehicle exterior lighting. If the state is severe enough, e.g., the control means 32 determines that the driver is incapacitated, the control means 32 may issue a control signal that causes the vehicle to perform an (emergency) braking procedure and/or, if the equipment allows, guiding the vehicle towards a safe position, e.g., near the curb or hard shoulder.

With this monitoring system 10, the state of the driver or other occupant of a vehicle can be determined based on his body and face/head pose. The body and head poses are extracted from an image that is processed by multiple machine learning models. A first model detects body key points of the driver inside the image. A second model classifies these key points into a specific body pose. A CNN may be used and trained to output key point confidence maps and part affinity fields. In a postprocessing step, the confidence maps and part affinity fields can be used to calculate the body key point coordinates, preferably relative to the input image size. Further, the output of the key point model is key point data that includes the coordinates of certain body key points, for instance hands, elbows, shoulders and also facial/head key points for nose, eyes, ears and the like. The body pose classification model takes all key points coordinates as input features. There may be a tree based ML model, for instance XGBoost or a Random Forest that can be trained to classify different body poses, for instance body normal, body leaning left or body leaning right.

Further, to detect the face/head pose, the facial and head key point coordinates from the output of the key point model are reused to calculate the region of the vehicle occupant's face. The calculated area is then cropped from the original image and used as the input to a CNN classifier. The CNN classifier can be trained to detect different face/head poses, for instance face looking up, face looking straight, face looking down or face looking sidewards. The body pose and the face/head pose can be combined to get a more comprehensive estimation of the driver state. This is done within the control means 32. For instance, if the body is leaning sidewards for a short period of time, but the face is still looking straight, it could be still considered a normal driving position. Therefore, the instant research work reuses the output of the key point model to detect the face instead of applying a separate face detector which saves time and other computing resources.

In order to improve in-cabin monitoring systems for vehicles, the invention proposes computer-implemented method for determining a face/head pose class of a vehicle occupant from unlabeled image data (14). Initially, image data (14) that at least partially include the vehicle occupant are captured. Key point data (18) are extracted from the image data (14), wherein the key point data (18) include at least one key point, that is indicative of a specific body pad of the vehicle occupant. From the key point data (18), a face/head region within the image data (14) is determined, wherein the face/head region includes a face portion and/or head portion of the vehicle occupant. The original image data (14) are cropped in order to obtain cropped image data (22) that only include the face/head region. A face/head pose class is determined by classifying only the cropped image data (22) into one output face/head pose class (26) of a predetermined set of face/head pose classes.

REFERENCE SIGNS

monitoring system 12 imaging device 14 image data 16 key point extraction means 18 key point data image data cropping means 22 cropped image data 24 face/head pose classification means 26 output face/head pose class 28 body pose classification means output body pose class 32 control means

Claims

CLAIMS1. A computer-implemented method for determining a face/head pose class of a vehicle occupant from unlabeled image data (14) that at least partially includes the vehicle occupant, the method comprising: a) capturing image data (14) that at least partially include the vehicle occupant; b) extracting key point data (18) from the image data (14), wherein the key point data (18) include at least one key point, that is indicative of a specific body part of the vehicle occupant, and key point coordinates for each key point; c) determining from the key point data (18) obtained in step b) a face/head region within the image data (14), wherein the face/head region includes a face portion and/or head portion of the vehicle occupant, and cropping the face/head region from the image data (14) in order to obtain cropped image data (22) that only include the face/head region; d) determining a face/head pose class by classifying the cropped image data (22) into one output face/head pose class (26) of a predetermined set of face/head pose classes.
2. The method according to claim 1, characterized in that the vehicle occupant is a driver of the vehicle.
3. The method according to any of the preceding claims, characterized in that in step b) the key point data (18) is extracted by a machine learning model that includes a convolutional neural network.
4. The method according to any of the preceding claims, characterized in that in step b) each extracted key point is labelled with a body part label chosen from a group comprising or consisting of hands, elbows, shoulders, nose, eyes, ears, mouth, chin.
5. The method according to any of the preceding claims, characterized in that in step d) the face/head pose class is determined by a machine learning model that includes an artificial neural network that is trained to determine a probability score for each face/head pose class of the predetermined set, and the face/head pose class with the highest probability is selected as the output face/head pose class (26).
6. The method according to any of the preceding claims, characterized by the step: e) determining a body pose class by classifying the key point data (18) into one output body pose class (30) of a predetermined set of body pose classes.
7. The method according to claim 6, characterized in that in step e) the body pose class is determined by a tree-based machine learning model that is configured to determine the output body pose class (30) based only on the key point data (18).
8. An in-cabin monitoring method for monitoring at least one vehicle occupant within a vehicle, the method comprising: a) performing a method according to any of the preceding claims; b) evaluating the output face/head pose class (26) and optionally the output body pose class (30) and generating a control signal for the vehicle based on the evaluation of the respective output pose class.
9. The method according to claim 8, characterized in that in step b) the evaluation includes a time measurement of how long a specific output face/head pose class (26) and optionally output body pose class (30) are displayed by the vehicle occupant, and generating the control signal also on the time measurement.
10. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of the preceding claims.
11. A computer-readable medium having stored thereon or a data carrier signal carrying the computer program according to claim 10.
12. A classification device that is configured for determining an output face/head pose class (26) of a vehicle occupant from unlabeled image data (14) that at least partially includes the vehicle occupant, the device comprising: a) an imaging device (12) configured for capturing image data (14) that at least partially include the vehicle occupant; b) key point extraction means (16) that are configured for extracting key point data (18) from the image data (14), wherein the key point data (18) include at least one key point, that is indicative of a specific body part of the vehicle occupant, and key point coordinates for each key point; c) image data cropping means (20) that are configured for determining from the key point data (18) obtained in step b) a face/head region within the image data (14), wherein the face/head region includes a face portion and/or head portion of the vehicle occupant, and cropping the face/head region from the image data (14) in order to obtain cropped image data (22) that only include the face/head region; d) a face/head pose classification means (24) configured for determining a face/head pose class by classifying the cropped image data (22) into one output face/head pose class (26) of a predetermined set of face/head pose classes.
13. The device according to claim 12, characterized by e) body pose classification means (28) configured for determining a body pose class by classifying the key point data (18) into one output body pose class (30) of a predetermined set of body pose classes.
14. An in-cabin vehicle monitoring system (10) configured for monitoring at least one vehicle occupant within a vehicle, the device comprising: a) means for performing a method according to any of the claims 1 to 7; b) control means (32) configured for evaluating the output face/head pose class (26) and optionally the output body pose class (30) and generating a control signal for the vehicle based on the evaluation of the respective output pose class.
15. The system according to claim 14, characterized in that the control means (32) are configured for performing a time measurement of how long a specific output face/head pose class (26) and optionally output body pose class (30) are displayed by the vehicle occupant, and for generating the control signal also on the time measurement.