CN116612510A

CN116612510A - Biological feature task network

Info

Publication number: CN116612510A
Application number: CN202310104742.8A
Authority: CN
Inventors: 阿里·哈桑尼; 贾斯汀·米勒
Original assignee: Ford Global Technologies LLC
Current assignee: Ford Global Technologies LLC
Priority date: 2022-02-15
Filing date: 2023-02-13
Publication date: 2023-08-18

Abstract

The present disclosure provides a "biometric task network". A living prediction output from a living organism biometric analysis task determined by a deep neural network based on an image provided from an image sensor is provided, wherein the living organism biometric analysis is performed in the deep neural network, the deep neural network comprising a common feature extraction neural network and a plurality of task-specific neural networks, the plurality of task-specific neural networks comprising a face detection neural network, a body pose neural network, and a living organism neural network comprising a region of interest (ROI) detection neural network and a texture analysis neural network, to determine the living organism biometric analysis task by inputting the image to the common feature extraction neural network to determine potential variables. The latent variable may be input to the face detection neural network and the living neural network. An output from the face detection neural network may be input to the ROI detection neural network.

Description

Biological feature task network

Cross Reference to Related Applications

This patent application claims priority from U.S. provisional patent application No. 63/310,401, filed on 2.15 of 2022, which application is hereby incorporated by reference in its entirety.

Technical Field

The present disclosure relates to a biometric mission network in a vehicle.

Background

The image may be acquired by a sensor and processed using a computer to determine data about objects in the environment surrounding the system. The operation of the sensing system may include acquiring accurate and timely data about objects in the system environment. The computer may acquire images from one or more image sensors, which may be processed to determine data about the object. The computer may use data extracted from the image of the object to operate systems, including vehicles, robots, security systems, and/or object tracking systems.

Disclosure of Invention

Biometric analysis may be implemented in a computer to determine data about objects (e.g., potential users) in or around a system or machine, such as a vehicle. Based on the data determined from the biometric analysis, the vehicle may be operated, for example. The biometric analysis herein means measuring or calculating data about the user based on physical characteristics of the user. For example, a computing device in a vehicle or a traffic infrastructure system may be programmed to obtain one or more images from one or more sensors included in the vehicle or the traffic infrastructure system and grant a user permission to operate the vehicle based on biometric data determined from the images. Such grant of permission is referred to herein as biometric identification. Biometric identification means determining the identity of a potential user. The determined user identity may be recorded to track which user is accessing the vehicle and/or compared to a list of authorized users to authenticate the user before permission is granted to the user to operate the vehicle or system. Biometric analysis includes determining one or more physical characteristics, such as user drowsiness, gaze direction, user posture, user living being, and the like. In addition to vehicles, the biometric analysis tasks may also be applied to other machines or systems. For example, computer systems, robotic systems, manufacturing systems, and security systems may require that the acquired images be used to identify potential users prior to granting access to the system or secure area.

Advantageously, the techniques described herein may enhance the ability of computing devices in traffic infrastructure systems and/or vehicles to perform biometric analysis based on identifying facial biometric algorithms (such as facial feature recognition) including redundant tasks across different applications, for example, as discussed further below. Furthermore, some facial biometric algorithms have sparse or limited training data sets. The techniques described herein include a multi-tasking network including a common feature recognition neural network and a plurality of biometric analysis tasking neural networks. The deep neural network is configured to include a common feature extraction neural network as a "backbone" and a plurality of biometric analysis task neural networks that receive as input a set of common latent variables generated by the common feature extraction neural network. The deep neural network includes a plurality of expert pooled deep neural networks that enhance training of the deep neural network by sharing results among a plurality of biometric analysis tasks.

Disclosed herein is a method comprising providing a living prediction output from a living biometric analysis task determined by a deep neural network based on an image provided from an image sensor, wherein the living biometric analysis is performed in the deep neural network, the deep neural network comprising a common feature extraction neural network and a plurality of task-specific neural networks, the plurality of task-specific neural networks comprising a face detection neural network, a body pose neural network, and a living neural network comprising a region of interest (ROI) detection neural network and a texture analysis neural network, to determine the living biometric analysis task by inputting the image to the common feature extraction neural network to determine latent variables. The latent variable may be input to the face detection neural network and the living neural network. An output from the face detection neural network may be input to the ROI detection neural network. The output from the ROI detection neural network may be input to the texture analysis neural network. The output from the texture analysis neural network is input to a living body expert pooled neural network to determine the living body prediction, and the living body prediction may be output. The deep neural network may be trained by combining output from the texture analysis neural network and output from the body pose neural network in the living expert pooled neural network to determine the living predictions. The device may be operated based on a living prediction output from the deep neural network. The apparatus may be operated based on the living body prediction, including determining user authentication. The common feature extraction neural network may include a plurality of convolutional layers.

The plurality of task-specific neural networks may include a plurality of fully connected layers. Outputs from the face detection neural network and the body pose neural network may be input to a SoftMax function. The output from the living expert pooled neural network may be input to a SoftMax function. The deep neural network may be trained by: determining a first loss function based on outputs from the face detection neural network and the body pose neural network; determining a second loss function based on output from the living expert pooled neural network; determining a joint loss function based on combining the first loss function and the second loss function; and back-propagating the joint loss function through the deep neural network to determine deep neural network weights. The deep neural network is trained by processing a training dataset comprising images of ground truth multiple times and determining weights based on minimizing the joint loss function. During training, one or more outputs from the plurality of task-specific neural networks may be set to zero. The weights included in the plurality of task-specific neural networks may be frozen during training. The deep neural network may be trained based on a loss function determined from sparse classification cross entropy statistics. The deep neural network may be trained based on a loss function determined from mean square error statistics.

A computer readable medium storing program instructions for performing some or all of the above method steps is also disclosed. Also disclosed is a computer programmed to perform some or all of the above method steps, the computer comprising a computer device programmed to determine a living prediction output from a living biometric analysis task based on an image provided from an image sensor, wherein the living biometric analysis is performed in a deep neural network comprising a common feature extraction neural network and a plurality of task-specific neural networks comprising a face detection neural network, a body pose neural network, and a living neural network comprising a region of interest (ROI) detection neural network and a texture analysis neural network, to determine the living biometric analysis task by inputting the image to the common feature extraction neural network to determine a latent variable. The latent variable may be input to the face detection neural network and the living neural network. An output from the face detection neural network may be input to the ROI detection neural network. The output from the ROI detection neural network may be input to the texture analysis neural network. The output from the texture analysis neural network is input to a living body expert pooled neural network to determine the living body prediction, and the living body prediction may be output. The deep neural network may be trained by combining output from the texture analysis neural network and output from the body pose neural network in the living expert pooled neural network to determine the living predictions. The device may be operated based on a living prediction output from the deep neural network. The apparatus may be operated based on the living body prediction, including determining user authentication. The common feature extraction neural network may include a plurality of convolutional layers.

The instructions may include further instructions, wherein the plurality of task-specific neural networks may include a plurality of fully connected layers. Outputs from the face detection neural network and the body pose neural network may be input to a SoftMax function. The output from the living expert pooled neural network may be input to a SoftMax function. The deep neural network may be trained by: determining a first loss function based on outputs from the face detection neural network and the body pose neural network; determining a second loss function based on output from the living expert pooled neural network; determining a joint loss function based on combining the first loss function and the second loss function; and back-propagating the joint loss function through the deep neural network to determine deep neural network weights. The deep neural network is trained by processing a training dataset comprising images of ground truth multiple times and determining weights based on minimizing the joint loss function. During training, one or more outputs from the plurality of task-specific neural networks may be set to zero. The weights included in the plurality of task-specific neural networks may be frozen during training. The deep neural network may be trained based on a loss function determined from sparse classification cross entropy statistics. The deep neural network may be trained based on a loss function determined from mean square error statistics.

Drawings

Fig. 1 is a block diagram of an exemplary communication infrastructure system.

Fig. 2 is a diagram of an exemplary biometric image.

FIG. 3 is a diagram of an exemplary deep neural network of biological features including expert pooling.

Fig. 4 is a diagram of an exemplary multi-tasking biometric deep neural network including memory.

FIG. 5 is a flow chart of an exemplary process for training a deep neural network to perform a biometric analysis task.

FIG. 6 is a flow chart of an exemplary process for authenticating a deep neural network with the deep neural network.

Detailed Description

Fig. 1 is a diagram of a sensing system 100 that may include a traffic infrastructure node 105 that includes a server computer 120 and a stationary sensor 122. The sensing system 100 includes a vehicle 110 that is operable in an autonomous ("autonomous" itself means "fully autonomous" in the present disclosure) mode, a semi-autonomous mode, and an occupant driving (also referred to as non-autonomous) mode. The computing device 115 of one or more vehicles 110 may receive data regarding the operation of the vehicle 110 from the sensors 116. Computing device 115 may operate vehicle 110 in an autonomous mode, a semi-autonomous mode, or a non-autonomous mode.

The computing device 115 includes a processor and memory such as are known. Further, the memory includes one or more forms of computer-readable media and stores instructions executable by the processor to perform operations including as disclosed herein. For example, the computing device 115 may include programming to operate one or more of vehicle braking, propulsion (i.e., controlling acceleration of the vehicle 110 by controlling one or more of an internal combustion engine, an electric motor, a hybrid engine, etc.), steering, climate control, interior and/or exterior lights, etc., and determine whether and when the computing device 115 (rather than a human operator) is controlling such operations.

The computing device 115 may include, or be communicatively coupled to, more than one computing device (i.e., a controller included in the vehicle 110 for monitoring and/or controlling various vehicle components, etc. (i.e., the powertrain controller 112, the brake controller 113, the steering controller 114, etc.)), i.e., via a vehicle communication bus as described further below. The computing device 115 is typically arranged for communication over a vehicle communication network (i.e., including a bus in the vehicle 110, such as a Controller Area Network (CAN), etc.); the network of vehicles 110 may additionally or alternatively include, for example, known wired or wireless communication mechanisms, i.e., ethernet or other communication protocols.

The computing device 115 may transmit and/or receive messages to and/or from various devices in the vehicle (i.e., controllers, actuators, sensors (including sensor 116), etc.) via a vehicle network. Alternatively or additionally, where computing device 115 actually includes multiple devices, a vehicle communication network may be used to communicate between devices represented in this disclosure as computing device 115. Further, as mentioned below, various controllers or sensing elements (such as sensors 116) may provide data to the computing device 115 via a vehicle communication network.

In addition, the computing device 115 may be configured to communicate with a remote server computer 120 (i.e., cloud server) via a network 130 through a vehicle-to-infrastructure (V-to-I) interface 111, including hardware, firmware, and software that permit the computing device 115 to communicate via, for example, the wireless internet, as described belowOr a network 130 of cellular networks, is in communication with the remote server computer 120. Thus, V-pair I interface 111 may include a wireless interface configured to utilize various wired and/or wireless networking technologies (i.e., cellular,/-pair>And wired and/or wireless packet networks), memory, transceivers, and so forth. The computing device 115 may be configured to communicate with other vehicles 110 over the V-to-I interface 111 using a vehicle-to-vehicle (V-to-V) network (i.e., according to Dedicated Short Range Communications (DSRC) and/or the like), i.e., formed on a mobile ad hoc network basis between neighboring vehicles 110 or formed over an infrastructure-based network. Computing device 115 also includes a device such as a memory device Known as nonvolatile memory. The computing device 115 may record the data by storing the data in non-volatile memory for later retrieval and transmission to the server computer 120 or user mobile device 160 via the vehicle communication network and the vehicle-to-infrastructure (V-to-I) interface 111.

As already mentioned, programming for operating one or more vehicle 110 components (i.e., braking, steering, propulsion, etc.) without human operator intervention is typically included in instructions stored in memory and executable by a processor of computing device 115. Using data received in computing device 115 (i.e., sensor data from sensors 116, data of server computer 120, etc.), computing device 115 may make various determinations and/or control various vehicle 110 components and/or operations to operate vehicle 110 without a driver. For example, the computing device 115 may include programming to adjust vehicle 110 operational behaviors (i.e., physical manifestations of vehicle 110 operation), such as speed, acceleration, deceleration, steering, etc., as well as strategic behaviors (i.e., generally controlling operational behaviors in a manner intended to achieve efficient traversal of a route), such as distance between vehicles and/or amount of time between vehicles, lane changes, minimum clearance between vehicles, minimum left turn crossing path, arrival time to a particular location, and minimum time from arrival to intersection (no signal light) crossing.

The term controller as used herein includes computing devices that are typically programmed to monitor and/or control specific vehicle subsystems. Examples include a powertrain controller 112, a brake controller 113, and a steering controller 114. The controller may be, for example, a known Electronic Control Unit (ECU), possibly including additional programming as described herein. The controller is communicatively connected to the computing device 115 and receives instructions from the computing device to actuate the subsystems according to the instructions. For example, brake controller 113 may receive instructions from computing device 115 to operate brakes of vehicle 110.

The one or more controllers 112, 113, 114 for the vehicle 110 may include known Electronic Control Units (ECUs) or the like, including, as non-limiting examples, one or more powertrain controllers 112, one or more brake controllers 113, and one or more steering controllers 114. Each of the controllers 112, 113, 114 may include a respective processor and memory and one or more actuators. The controllers 112, 113, 114 may be programmed and connected to a vehicle 110 communication bus, such as a Controller Area Network (CAN) bus or a Local Interconnect Network (LIN) bus, to receive instructions from the computing device 115 and to control actuators based on the instructions.

The sensors 116 may include a variety of devices known to provide data via a vehicle communication bus. For example, a radar fixed to a front bumper (not shown) of the vehicle 110 may provide a distance from the vehicle 110 to a next vehicle in front of the vehicle 110, or a Global Positioning System (GPS) sensor provided in the vehicle 110 may provide geographic coordinates of the vehicle 110. For example, distances provided by radar and/or other sensors 116 and/or geographic coordinates provided by GPS sensors may be used by computing device 115 to autonomously or semi-autonomously operate vehicle 110.

The vehicle 110 is typically a ground-based vehicle 110 (i.e., passenger vehicle, pickup truck, etc.) capable of autonomous and/or semi-autonomous operation and having three or more wheels. The vehicle 110 includes one or more sensors 116, a V-to-I interface 111, a computing device 115, and one or more controllers 112, 113, 114. The sensors 116 may collect data related to the vehicle 110 and the operating environment of the vehicle 110. By way of example and not limitation, the sensor 116 may include a height gauge, a camera, a LIDAR, a radar, an ultrasonic sensor, an infrared sensor, a pressure sensor, an accelerometer, a gyroscope, a temperature sensor, a pressure sensor, a hall sensor, an optical sensor, a voltage sensor, a current sensor, a mechanical sensor (such as a switch), and the like. The sensor 116 may be used to sense the operating environment of the vehicle 110, i.e., the sensor 116 may detect phenomena such as weather conditions (rain, outside ambient temperature, etc.), road grade, road location (i.e., using road edges, lane markings, etc.), or the location of a target object (such as a nearby vehicle 110). The sensors 116 may also be used to collect data, including dynamic vehicle 110 data related to the operation of the vehicle 110, such as speed, yaw rate, steering angle, engine speed, brake pressure, oil pressure, power level applied to the controllers 112, 113, 114 in the vehicle 110, connectivity between components, and accurate and timely execution of components of the vehicle 110.

The vehicle may be equipped to operate in both an autonomous mode and an occupant driving mode. Semi-autonomous mode or fully autonomous mode means an operating mode in which the vehicle may be driven partially or fully by a computing device that is part of a system having sensors and a controller. The vehicle may be occupied or unoccupied, but in either case, the vehicle may be driven partially or fully without occupant assistance. For purposes of this disclosure, autonomous mode is defined as a mode in which each of vehicle propulsion (i.e., via a powertrain including an internal combustion engine and/or an electric motor), braking, and steering is controlled by one or more vehicle computers; in semi-autonomous mode, the vehicle computer controls one or more of vehicle propulsion, braking, and steering. In the non-autonomous mode, none of these are controlled by the computer.

The traffic infrastructure node 105 may include a physical structure such as a tower or other support structure (i.e., pole, box mountable to bridge supports, cell phone tower, roadway sign supports, etc.), on which the infrastructure sensor 122 and server computer 120 may be mounted, stored, and/or housed and powered, etc. For ease of illustration, one traffic infrastructure node 105 is shown in fig. 1, but the system 100 may and likely will include tens, hundreds, or thousands of traffic infrastructure nodes 105. The traffic infrastructure nodes 105 are typically stationary, i.e., fixed to a particular geographic location and cannot be moved from that location. Infrastructure sensors 122 may include one or more sensors, such as those described above for vehicle 110 sensors 116, i.e., lidar, radar, cameras, ultrasonic sensors, and the like. Infrastructure sensors 122 are fixed or stationary. That is, each sensor 122 is mounted to an infrastructure node so as to have a field of view that is substantially non-moving and unchanged.

The server computer 120 generally has features in common with the V-to-I interface 111 and the computing device 115 of the vehicle 110, and thus will not be further described to avoid redundancy. Although not shown for ease of illustration, the traffic infrastructure node 105 also includes a power source, such as a battery, solar cell, and/or connection to a power grid. The server computer 120 of the traffic infrastructure node 105 and/or the computing device 115 of the vehicle 110 may receive the data of the sensors 116, 122 to monitor one or more objects. In the context of the present disclosure, an "object" is a physical (i.e., material) structure detected by the vehicle sensors 116 and/or the infrastructure sensors 122. The object may be a biological object, such as a person. The server computer 120 and/or the computing device 115 may perform a biometric analysis on the object data acquired by the sensors 116/122.

Fig. 2 is a diagram of an image acquisition system 200 included in a vehicle 110 or a traffic infrastructure node 105. The image 200 may be acquired by a camera 204, which may be a sensor 116, 122, having a field of view 208. For example, when a user approaches the vehicle 110, the image 200 may be acquired. The computing device 115 or the server computer 120 in the vehicle 110 may perform a biometric analysis task that authenticates the user and grants the user permission to operate the vehicle 110, i.e., unlock the doors to allow the user to enter the vehicle 110. In addition to biometric identification, a spoof-detection biometric analysis task (such as a live detection) may also be used to determine whether the image data presented for user identification is a real image of a real user, i.e., not a photograph of the user or not a mask of the user.

Image 200 includes user 202. The field of view 208 of the camera 204 is configured to include a portion of the user's body in addition to the user's face. In image 200, user 202 is lifting a photograph of a person's face in an attempt to fool the biometric system into authorizing them to operate the device. Image 200 includes a plurality of points indicated by star 210 that indicate the gesture features determined by the gesture detection biometric analysis task. The posture features may be determined by any suitable body posture biometric analysis task. Body posture features include the position of skeletal joints (including wrist, elbow, shoulder, etc.). When combined with data regarding the location of the detected user's face, the location of the star 210 in the image 200 may indicate that the user is holding something in front of them, in this example a photograph.

Other biometric analysis tasks based on images of the user are known and include drowsiness detection, head pose detection, gaze detection, and emotion detection. Drowsiness detection can generally determine the alertness state of a user by analyzing eyelid position and blink rate. Head pose detection may also determine the alertness and attentiveness status of a user to vehicle operation, typically by analyzing the position and orientation of the user's face to detect nodding and bowing gestures. Gaze detection may determine the direction in which the user's eyes are looking to determine whether the user is focusing on vehicle operation. The emotion detection may determine an emotional state of the user, e.g., whether the user is irritated or not, to detect distraction that may be to the operation of the vehicle.

Fig. 3 is a diagram of a living organism biological analysis task system implemented as a Deep Neural Network (DNN) 300. For example, DNN 300 may be implemented in server computer 120 or computing device 115. DNN 300 has been configured to determine a living body by combining body posture detection with facial feature detection. Common feature extraction or "stem" CNN 310 extracts common features from the image data and outputs latent variables. The latent variables in DNN 300 represent common facial and physical features that are output by CNN 310 in response to input image 302, including a human. CNN 310 includes a plurality of convolution layers 304, 306, 308 that process image 302 to determine the location of faces and body parts and to determine features indicative of the position, orientation, and size of body poses and facial features such as hands, arms, head, eyes, nose, and mouth. CNN 310 may also determine facial features such as skin tone, texture, and the presence/absence of facial hair, as well as the presence/absence of objects (such as eyeglasses and body perforations).

The body and facial features are referred to as common features because they are output as latent variables common to multiple biometric analysis task neural networks, including the body posture neural network 314, the face detection neural network 312, and the living neural network 320. The body posture neural network 314, the face detection neural network 312, and the living neural network 320 include a plurality of fully connected layers that can perform biometric analysis tasks such as body posture detection, face detection, and living body. The output from the living neural network 320 is passed to a region of interest (ROI) detection neural network 322 and then to the texture.

The biometric analysis task neural network, including the body pose neural network 314, the face detection neural network 312, and the living neural network 320, is trained to output predictions about biometric analysis tasks based on potential variables output from the common feature extraction CNN 310. The predictions output from the biometric analysis task include living predictions about the user. Live prediction is the probability that the image 302 includes a live user instead of a photograph or mask of the user. Body posture prediction includes an estimation of the position of the user's arms and hands. The face prediction includes the position and pose of the user's face.

DNN 300 may include SoftMax functions 316, 318, 328 on the outputs of body pose neural network 314, face detection neural network 312, and living expert pooled neural network 326, respectively. The SoftMax functions 316, 318, 328 are functions that convert vectors of K real values to vectors of K real values that sum to 1 between 0 and 1. The outputs from SoftMax functions 316, 318, 328 may be output as a biometric analysis task or used to calculate a loss function. Processing the outputs of the body pose neural network 314, the face detection neural network 312, and the living expert pooling neural network 326 with SoftMax functions 316, 318, 328 allows the outputs to be combined into a joint loss function for training the DNN 400. The loss function compares the outputs of the SoftMax functions 316, 318, 328 from the body position neural network 314, the face detection neural network 312, and the living expert pooling neural network 326 with ground truth values to determine how close the biometric analysis task neural network is to the correct result. The SoftMax functions 316, 318, 328 may limit the output from the body posture neural network 314, the face detection neural network 312, and the living expert pooling neural network 326 to values between 0 and 1. The SoftMax functions 316, 318, 328 may thereby prevent one or more of the outputs from dominate the computation of the joint loss function due to the difference in the values of the outputs. The joint loss function is determined by combining separate loss functions for each of the body pose neural network 314, the face detection neural network 312, and the living expert pooled neural network 326 (typically by summing the separate loss functions together).

Training DNN 300 may include determining a training dataset of images 302 and a ground truth for each image 302. Determining the ground truth may include manually examining 302 images in a training dataset and estimating the correct results expected from SoftMax functions 316, 318, 328 connected to the body pose neural network 314, the face detection neural network 312, and the living expert pooling neural network 326. During training, image 302 may be input to DNN 300 multiple times. Each time the image 302 is input, the weights or parameters controlling the operation of the convolutional layers 304, 306, 308 of the CNN 310 and the fully connected layers of the body pose neural network 314, face detection neural network 312, ROI detection neural network 322, texture analysis neural network 324, and living expert pooling neural network 326 may be changed, and a joint loss function based on the output and the ground truth value of each output may be determined. The joint loss function may be counter-propagated through DNN 300 to select the weight for each layer that results in the lowest joint loss, which is generally considered to represent the most correct result. Back propagation is a process for applying a joint loss function to the weights of DNN 300 layers, starting at the layer closest to the output and proceeding back to the layer closest to the input. By processing multiple input images 302 including ground truth in the training dataset multiple times, a set of weights for the layers of DNN 300 may be selected that converge to a correct result for the entire training dataset of DNN 300 may be determined. Selecting a set of optimal weights in this way is referred to as training DNN 300.

Training DNN300 using the joint loss function may advantageously compensate for differences in the training dataset for each biometric analysis task. The training quality of the biometric tasks may depend on the number of available images with the appropriate ground truth for each biometric analysis task. For example, face detection and body pose detection may benefit from the presence of large commercial data sets including ground truth. Other biometric analysis tasks, such as determining living subjects, may require the user to acquire image data and manually estimate ground truth. Advantageously, training DNN300 using the joint loss function determined as discussed herein may allow a training dataset to be shared between a biometric analysis task with a large training dataset and a biometric analysis task with a smaller training dataset. For example, individual loss functions may be weighted when combined to form a joint loss function to give more weight to the loss function that includes a larger training data set.

In examples where one or more of the biometric analysis tasks may have a small amount of ground truth data included in the training dataset, training of DNN300 may be enhanced by branch training isolation. For example, a typical training dataset may include thousands of body pose images with corresponding ground truths, while only including a few hundred images including live ground truths. The branch training isolation sets the output of the biometric analysis mission neural network from ground truth data without a specific image in the training dataset to a null value. Setting the output from the biometric task neural network to a null value also sets the loss function determined for the biometric task neural network to zero. The branch training isolation also freezes weights included in the biometric analysis task neural network of the image. This allows the biometric analysis task neural network to be available for joint training with the rest of DNN300 without penalizing the biometric analysis task with a sparse training dataset. For example, drowsiness detection typically has fewer ground truth images than recognition tasks.

DNN 300 combines the output from face detection SoftMax function 316 with the output from living neural network 320 in region of interest (ROI) detection neural network 322. The ROI detection neural network 322 determines which portions of the latent variables output by the CNN 310 to process to determine the facial features of the living subject most likely to be determined. For example, skin tones determined in areas of the user's face not covered by facial hair may be output to the texture analysis neural network 324. The texture analysis neural network 324 may determine a living body by processing the face to be a region to determine that the texture analysis neural network 324 has been trained to be expected feature textures in the region of the face (e.g., eyes, mouth, and skin).

The in-vivo prediction output from the texture analysis neural network 324 is output to the in-vivo expert pooled neural network 326, where it is combined with the body pose prediction determined by the body pose SoftMax function 318. The living body expert pooled neural network 326 determines living body predictions by inputting living body predictions output by the texture analysis neural network 324 and combining the living body predictions with body posture predictions output by the body posture SoftMax function 318. The living expert pooled neural network 326 outputs living predictions to a SoftMax function 328. The living expert pooling allows sharing of data between biometric analysis tasks (such as body gestures and face detection) that may have a large training data set that includes ground truth, and biometric analysis tasks (such as living beings) that may have a much smaller training data set with limited ground truth data. The training DNN 300 may be enhanced by determining a joint loss function using the living expert pooled neural network 324, the joint loss function including ground truth data from a portion of the biometric analysis tasks included in the training dataset. Despite the lack of ground truth data for one or more of the biometric analysis tasks included in DNN 300, DNN 300 may still be trained using joint loss functions.

The techniques discussed herein use dynamic loss schemes to reduce destructive training interference. Losses are typically normalized to be in the range of 0 to 1 based on the complexity of the loss function using a SoftMax function. The individual biometric analysis task loss functions may be further weighted based on training importance. Early during the training process, input functions (such as face detection and body pose detection) are prioritized. Once the verification accuracy of face detection and body pose detection improves as evidenced by the joint loss function beginning to converge to a minimum, the individual loss functions may be weighted for more difficult tasks (such as living). After training, DNN300 may be transmitted to computing device 115 in vehicle 110 or server computer 120 included in traffic infrastructure system 105 to perform a biometric analysis task on image data acquired by sensors 116, 122 included in vehicle 110 or traffic infrastructure system 105. Manipulating the trained DNN300 to determine predictions about input data is referred to as inference.

Fig. 4 is a diagram of a living organism biological analysis task system implemented as a Deep Neural Network (DNN) 400. For example, DNN 400 may be implemented in server computer 120 or computing device 115. DNN 400 includes the same components as included in DNN300, except for the memory 402, 404 added to the body posture neural network 314 data path and the living neural network 320 data path. The memories 402, 404 may store output data from a plurality of input images 302. For example, multiple frames of video data (about one second of video data) from a short video sequence may be input to DNN 400 as separate images 302. The memories 402, 404 may store the output results as a 3D stack of time data, with a third dimension indicating time. The time data may be input to the living expert pooled neural network 326 to determine the living organism.

The temporal data may enhance the determination of living organisms used in the anti-spoofing system. Multiple frame analysis may be used to determine a motion metric that indicates a change in body posture, such as how a person lifts a spoof photograph, or that the spoofed input image has no change from texture changes in the person's image. Analysis of the motion metric may enhance in vivo determinations by living expert pooled neural network 326. Training DNN 400 to determine motion metrics requires sequence data, i.e., video sequences annotated with ground truth data, which may not be suitable for all tasks. To accommodate the lack of sequence data, the memories 402, 404 are designed to work with both the individual images 302 and the image sequence 302. In examples where sequence data is not available, the memories 402, 404 pass the data without modification. In examples where sequence data is available, the sequence data is stored as time data and output when the sequence is complete. Processing of the temporal data includes queuing data of multiple frames of the input image 302, calculating variations in the output data across the multiple frames, and 3D convolution. For example, the living expert pooled neural network 326 may detect lack of motion in a photograph with normal facial motion, i.e., blink and mouth motion, included in a video sequence of a human face.

Examples of DNNs 300, 400 may be enhanced by training DNNs 300, 400 as shown in fig. 3 and 4, and then removing body posture neural network 314 from DNNs 300, 400 at an inferred time. The output from the body position neural network 314 will still help train the in vivo expert pooled neural network 326, but the output from the body position neural network 314 will not be included in the in vivo expert pooled neural network 326 at the time of inference. Determining a living body without input from the body position neural network 314 to the living body expert pooling neural network 326 via the SoftMax function 316 may provide accurate results at the time of inference because the living body is trained using the body position neural network 314 output. Removing the body posture neural network 314 at the inferred time may reduce the computational resources required to perform the DNNs 300, 400 while maintaining the output accuracy of the in vivo predictions.

Fig. 5 is a flow chart of a process 500 for training DNNs 300 including a common feature extraction CNN310 and a plurality of biometric analysis task neural networks including a body posture neural network 314, a face detection neural network 312, and a living expert pooling neural network 326 described with respect to fig. 1-4. The process 500 may be implemented by a processor of the computing device 115 or the server computer 120 that takes as input and outputs the biometric analysis task predictions image data from the sensors 116, 122 executing the commands. DNN300 is typically executed on server computer 120 on traffic infrastructure node 105 at a training time and transmitted to computing devices 115 in vehicle 110 for operation at an inferred time. Process 500 includes a plurality of blocks that may be performed in the order shown. Alternatively or additionally, process 500 may include fewer blocks or may include blocks performed in a different order.

The process 500 begins at block 502, where the image 302 is acquired from a training dataset. The image 302 includes a ground truth of one or more biometric analysis tasks as discussed above with respect to fig. 3. In examples where the ground truth is not available for one or more biometric analysis tasks, the outputs from the body pose neural network 314, the face detection neural network 312, the ROI detection neural network 322, the texture analysis neural network 324, and the in-vivo expert pooling neural network 326 may be set to null values, and the weights for the body pose neural network 314, the face detection neural network 312, the ROI detection neural network 322, the texture analysis neural network 324, and the in-vivo expert pooling neural network 326 may be frozen. Freezing the neural network prevents updating weights that program the neural network based on the joint loss function determined at block 510 from back-propagating.

At block 504, the image 302 is input to the common feature extraction CNN 310 to determine body and facial features to be output as latent variables, as discussed above with respect to fig. 3.

At block 506, the latent variables are input to the body pose neural network 314, the face detection neural network 312, the ROI detection neural network 322, the texture analysis neural network 324, and the living expert pooling neural network 326, which process the latent variables to determine predictions about the input image 302, as discussed above with respect to fig. 4. At the inferred time, the prediction may be output to computing device 115 for operating vehicle 110 or another device.

At block 508, predictions output from the body pose neural network 314 and the face detection neural network 312 are input to SoftMax functions 318, 316, respectively, at training time to adjust the output predictions to be between 0 and 1. Adjusting the output predictions allows combining the output predictions into a joint loss function without one or more output predictions numerically dominate the calculation.

At block 510, the output from the face detection SoftMax function 316 is input to the ROI detection neural network 322 along with the output from the living neural network 320. The output from the ROI detection neural network 322 is passed to the texture analysis neural network 324. The output from the texture analysis neural network 324 is combined with the output from the body pose SoftMax function 318 at the in vivo expert pooled neural network 326. The in-vivo prediction output from the in-vivo expert pooled neural network 326 is input to a SoftMax function 328 to adjust the output prediction to be in the range of 0 to 1.

At block 512, the outputs from SoftMax functions 316, 318, 328 are input to separate loss functions, which are then combined into a joint loss function. As discussed above with respect to fig. 3, individual loss functions may be weighted before they are added to the joint loss function, depending on which biometric analysis task they indicate and at which point in the training process it is.

At block 514, the joint loss function may be counter-propagated through the layers of DNN 300 to determine the optimal weights for the layers of DNN 300. The optimal weights are the weights that result in the output matching best to the ground truth included in the training dataset. Training DNN 300 includes inputting input image 302 multiple times while changing the weights of programming the layers of DNN 300, as discussed above with respect to fig. 3. Training DNN 300 includes selecting weights for layers of DNN 300 that provide a lowest joint loss function for a maximum number of input images 302 in the training dataset.

At block 516, the trained DNN 300 may be output to the computing device 115 or another device in the vehicle 110 for use in determining a living being of the input image 302. After block 516, process 500 ends.

Fig. 6 is a flowchart of a process 600 for performing a biometric analysis task with the DNN 300 including the common feature extraction CNN 310 and three biometric analysis task neural networks (including the body posture neural network 314, the face detection neural network 312, and the living expert pooling neural network 326) described with respect to fig. 1-5. The process 600 may be implemented by a processor of the computing device 115 or the server computer 120 that takes as input and outputs the biometric analysis task predictions image data from the sensors 116, 122 executing the commands. Alternatively or additionally, process 600 may include fewer blocks or may include blocks performed in a different order.

After training DNN 300 as discussed above with respect to fig. 3 and 5, trained DNN 300 may be transmitted to computing device 115 in a device such as vehicle 110 for inference. As discussed above with respect to fig. 3 and 5, in this example, DNN 300 will be used for in vivo determination. DNN 300 may be used for in vivo determination of devices other than vehicle 110. The living body may be combined with biometric identification, wherein the living body may be used to confirm that biometric identification is performed on, for example, an image 302 of the living person instead of a photograph of a face.

The process 600 begins at block 602, where the computing device 115 acquires the image 302 using the sensor 116 (such as the camera 204 included in the vehicle 110). The field of view 206 of the camera 204 is configured to provide an image 302 that includes a portion of the user's body in addition to the user's face. The process 600 may also be implemented in a security system, a robotic guidance system, or a handheld device such as a cellular telephone that seeks to determine the living body of a potential user prior to granting access to the device.

At block 604, the computing device 115 inputs the image 302 to the common feature extraction CNN310 to determine body and facial features to be output as latent variables.

At block 606, latent variables are input to the body pose neural network 314, the face detection neural network 312, and the living neural network 320, and combined using the ROI detection neural network 322, the texture analysis neural network 324, and the living expert pooling neural network 326 to determine predictions about probabilities of acquiring the user's input image 302 from a living person.

At block 608, predictions regarding biometric identification and living organisms are output from DNN 400 to computing device 115 in vehicle 110.

At block 610, computing device 115 may combine the in-vivo prediction output from DNN 300 with an identity prediction determined by a biometric identification system. The biometric system may output a prediction as an image of the user indicating a probability of the user for training the biometric system. The probability indicating biometric identification and the probability of the living being may be multiplied to determine an overall identification score that reflects the probability that the user in image 302 is an approved user and that image 302 includes a living being.

At block 612, computing device 115 tests the authentication score from block 610. If the overall identification score is greater than the user-specified threshold, the user is authenticated and process 600 moves to block 614. The threshold may be determined empirically by testing a plurality of spoofed and authentic input images 302 to determine a distribution of spoofed and non-spoofed overall identification scores and a threshold separating the two distributions. If the authentication score is below the user-specified threshold, the user is not authenticated and the process 600 ends without authenticating the user.

At block 614, the user has been authenticated and the user is granted permission to operate the vehicle 110. This may include unlocking the vehicle door to allow access to the vehicle 110 and opening the vehicle 110 control to allow the user to operate the vehicle 110, thereby granting the user permission to operate various vehicle components (such as climate controls, infotainment, etc., to name a few). After block 614, process 600 ends.

Computing devices such as those discussed herein typically each include commands that are executable by one or more computing devices such as those identified above and for implementing blocks or steps of the processes described above. For example, the process blocks discussed above may be embodied as computer-executable commands.

The computer-executable commands may be compiled or interpreted by a computer program created using a variety of programming languages and/or techniques, including but not limited to the following, singly or in combination: java (Java) ^TM C, C ++, python, julia, SCALA, visual Basic, java Script, perl, HTML, etc. In general, a processor (i.e., microprocessor) receives, i.e., from memory, a computer-readable medium, etc., and executes these commands, thereby performing one or more processes, including one or more of the processes described herein. Such commands and other data may be stored in files and transmitted using a variety of computer readable media. Files in a computing device are typically stored on a medium such as a storage medium, random access memory, or the like A collection of data on a computer readable medium.

Computer-readable media (also referred to as processor-readable media) includes any non-transitory (i.e., tangible) media that participates in providing data (i.e., instructions) that may be read by a computer (i.e., by a processor of a computer). Such a medium may take many forms, including but not limited to, non-volatile media and volatile media. The instructions may be transmitted over one or more transmission media, including fiber optic, wire, wireless communications, including internal components that make up a system bus coupled to the processor of the computer. Common forms of computer-readable media include, for example, RAM, PROM, EPROM, FLASH-EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.

Unless explicitly indicated to the contrary herein, all terms used in the claims are intended to be given their ordinary and customary meaning as understood by those skilled in the art. In particular, the use of singular articles such as "a," "an," "the," and the like are to be construed to recite one or more of the indicated elements unless a claim recites an explicit limitation to the contrary.

The term "exemplary" is used herein in the sense of indicating examples, i.e., references to "exemplary widgets" should be read as referring to only examples of widgets.

The adverb "about" of a modifier or result means that the shape, structure, measurement, value, determination, calculation, etc., may deviate from the exactly described geometry, distance, measurement, value, determination, calculation, etc., due to imperfections in materials, machining, manufacturing, sensor measurement, calculation, processing time, communication time, etc.

In the drawings, like reference numerals refer to like elements. Furthermore, some or all of these elements may be changed. With respect to the media, processes, systems, methods, etc., described herein, it should be understood that while the steps or blocks of such processes, etc., have been described as occurring according to a particular ordered sequence, such processes may be practiced by performing the described steps in an order other than that described herein. It should also be understood that certain steps may be performed concurrently, other steps may be added, or certain steps described herein may be omitted. In other words, the description of the processes herein is provided for the purpose of illustrating certain embodiments and should in no way be construed as limiting the claimed invention.

According to the present invention, there is provided a system having: a computer comprising a processor and a memory, the memory comprising instructions executable by the processor to: providing a living body prediction from a living body biometric analysis task determined by a deep neural network based on an image provided from an image sensor, wherein the living body biometric analysis is performed in the deep neural network, the deep neural network comprising a common feature extraction neural network and a plurality of task-specific neural networks, the plurality of task-specific neural networks comprising a face detection neural network, a body pose neural network, and a living body neural network comprising a region of interest (ROI) detection neural network and a texture analysis neural network, to determine the living body biometric analysis task by: inputting the image to the common feature extraction neural network to determine potential variables; inputting the latent variable to the face detection neural network and the living neural network; inputting an output from the face detection neural network to the ROI detection neural network; inputting an output from the ROI detection neural network to the texture analysis neural network; inputting an output from the texture analysis neural network to a living expert pooled neural network to determine the living prediction; and outputting the living body prediction.

According to one embodiment, the instructions comprise further instructions for: the deep neural network is trained by combining, in the living expert pooled neural network, output from the texture analysis neural network and output from the body pose neural network to determine the living predictions.

According to one embodiment, the invention also features an apparatus wherein the instructions include instructions for operating the apparatus based on a living prediction output from the deep neural network.

According to one embodiment, operating the device based on the living body prediction includes determining user authentication.

According to one embodiment, the common feature extraction neural network comprises a plurality of convolutional layers.

According to one embodiment, the plurality of task-specific neural networks includes a plurality of fully connected layers.

According to one embodiment, the outputs from the face detection neural network and the body pose neural network are input to a SoftMax function.

According to one embodiment, the output from the living expert pooled neural network is input to a SoftMax function.

According to one embodiment, the instructions comprise further instructions for training the deep neural network by: determining a first loss function based on outputs from the face detection neural network and the body pose neural network; determining a second loss function based on output from the living expert pooled neural network; determining a joint loss function based on combining the first loss function and the second loss function; and back-propagating the joint loss function through the deep neural network to determine deep neural network weights.

According to one embodiment, the deep neural network is trained by processing a training dataset comprising images of ground truth multiple times and determining weights based on minimizing the joint loss function.

According to one embodiment, during training, one or more outputs from the plurality of task-specific neural networks are set to zero.

According to the invention, a method comprises: providing a living body prediction output from a living body biometric analysis task determined by a deep neural network based on an image provided from an image sensor, wherein the living body biometric analysis is performed in the deep neural network, the deep neural network including a common feature extraction neural network and a plurality of task-specific neural networks including a face detection neural network, a body pose neural network, and a living body neural network including a region of interest (ROI) detection neural network and a texture analysis neural network, to determine the living body biometric analysis task by: inputting the image to the common feature extraction neural network to determine potential variables; inputting the latent variable to the face detection neural network and the living neural network; inputting an output from the face detection neural network to the ROI detection neural network; inputting an output from the ROI detection neural network to the texture analysis neural network; inputting an output from the texture analysis neural network to a living expert pooled neural network to determine the living prediction; and outputting the living body prediction.

In one aspect of the invention, the method includes training the deep neural network by combining output from the texture analysis neural network and output from the body pose neural network in the living expert pooled neural network to determine the living predictions.

In one aspect of the invention, the method includes an apparatus, wherein the apparatus is operated based on a living prediction output from the deep neural network.

In one aspect of the invention, operating the device based on the living body prediction includes determining user authentication.

In one aspect of the invention, the common feature extraction neural network includes a plurality of convolutional layers.

In one aspect of the invention, the plurality of task-specific neural networks includes a plurality of fully connected layers.

In one aspect of the invention, the outputs from the face detection neural network and the body pose neural network are input to a SoftMax function.

In one aspect of the invention, the output from the living expert pooled neural network is input to a SoftMax function.

In one aspect of the invention, the method includes training the deep neural network by: determining a first loss function based on outputs from the face detection neural network and the body pose neural network; determining a second loss function based on output from the living expert pooled neural network; determining a joint loss function based on combining the first loss function and the second loss function; and back-propagating the joint loss function through the deep neural network to determine deep neural network weights.

Claims

1. A method, comprising:

providing a living organism prediction output from a living organism biometric analysis task determined by the deep neural network based on the image provided from the image sensor;

wherein the living organism analysis is performed in a deep neural network comprising a common feature extraction neural network and a plurality of task-specific neural networks including a face detection neural network, a body pose neural network, and a living organism neural network comprising a region of interest (ROI) detection neural network and a texture analysis neural network to determine the living organism analysis task by:

inputting the image to the common feature extraction neural network to determine potential variables;

inputting the latent variable to the face detection neural network and the living neural network;

inputting an output from the face detection neural network to the ROI detection neural network;

inputting an output from the ROI detection neural network to the texture analysis neural network;

inputting an output from the texture analysis neural network to a living expert pooled neural network to determine the living prediction; and

Outputting the living body prediction.

2. The method of claim 1, further comprising training the deep neural network by combining output from the texture analysis neural network and output from the body pose neural network in the living expert pooled neural network to determine the living predictions.

3. The method of claim 1, further comprising a device, wherein the device is operated based on a living prediction output from the deep neural network.

4. The method of claim 3, wherein operating the device based on the living body prediction comprises determining user authentication.

5. The method of claim 1, wherein the common feature extraction neural network comprises a plurality of convolutional layers.

6. The method of claim 1, wherein the plurality of task-specific neural networks comprises a plurality of fully connected layers.

7. The method of claim 1, wherein the outputs from the face detection neural network and the body pose neural network are input to a SoftMax function.

8. The method of claim 1, wherein an output from the living expert pooled neural network is input to a SoftMax function.

9. The method of claim 1, further comprising training the deep neural network by:

determining a first loss function based on the outputs from the face detection neural network and the body pose neural network;

determining a second loss function based on output from the living expert pooled neural network;

determining a joint loss function based on combining the first loss function and the second loss function; and

the joint loss function is counter-propagated through the deep neural network to determine deep neural network weights.

10. The method of claim 9, wherein the deep neural network is trained by processing a training dataset comprising images of ground truth multiple times and determining weights based on minimizing the joint loss function.

11. The method of claim 1, wherein during training, one or more outputs from the plurality of task-specific neural networks are set to zero.

12. The method of claim 1, wherein weights included in the plurality of task-specific neural networks are frozen during training.

13. The method of claim 1, wherein the deep neural network is trained based on a loss function determined from sparse classification cross entropy statistics.

14. The method of claim 1, wherein the deep neural network is trained based on a loss function determined from mean square error statistics.

15. A system comprising a computer programmed to perform the method of any one of claims 1 to 14.