CN117835900A

CN117835900A - Imaging photoplethysmography (IPPG) system and method for remote measurement of vital signs

Info

Publication number: CN117835900A
Application number: CN202280056911.9A
Authority: CN
Inventors: T·马克斯; H·曼苏尔; S·洛希特; A·科马斯·马萨基; 刘小明
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2021-08-26
Filing date: 2022-07-21
Publication date: 2024-04-05

Abstract

An imaging photoplethysmography (iPPG) system is provided. The iPPG system receives a sequence of images of different regions of human skin, wherein each region comprises pixels of different intensities indicative of a color change of the skin. The iPPG system further transforms the image sequence into a multi-dimensional time series signal, each dimension corresponding to a different region among different regions of skin. The iPPG system further processes a multi-dimensional time series signal with a time series U-network, wherein the pass-through layer comprises a recurrent neural network to generate PPG waveforms; wherein the vital signs of the person are estimated based on the PPG waveform and the iPPG system further presents the estimated vital signs of the person.

Description

Imaging photoplethysmography (IPPG) system and method for remote measurement of vital signs

Technical Field

The present disclosure relates generally to remote monitoring of vital signs of a person, and more particularly to imaging photoplethysmography (iPPG) systems and methods for remote measurement of vital signs.

Background

Vital signs of a person, such as Heart Rate (HR), heart Rate Variability (HRV), respiratory Rate (RR) or blood oxygen saturation, are used as indicators of the current state of the person and as potential predictors of serious medical events. For this reason, vital signs are widely monitored in hospitalized and outpatient care environments, at home, and in other health, leisure, and fitness environments. One method of measuring vital signs is plethysmography. Plethysmography corresponds to measuring a volume change of an organ or body part of a person. Plethysmography has a variety of implementations, such as photoplethysmography (PPG).

PPG is an optical measurement technique that evaluates time-varying changes in light reflectivity or transmittance of a region or volume of interest, which can be used to detect blood volume changes in a microvascular bed of tissue. PPG is based on the following principle: blood absorbs and reflects light differently from surrounding tissue, and thus changes in blood volume accompanying each heartbeat affect light transmission or reflection, respectively. PPG is commonly used to make measurements non-invasively at the skin surface. The PPG waveform includes a pulsatile physiological waveform due to heart synchronous changes in blood volume accompanying each heartbeat, and is superimposed on a slowly varying baseline with various low frequency components due to other factors such as respiration, sympathetic nervous system activity, and temperature regulation.

Conventional pulse oximeters for measuring the heart rate and (arterial) blood oxygen saturation of a person are attached to the skin of the person, such as the fingertip, earlobe or forehead. Therefore, they are referred to as "contact" PPG devices. A typical pulse oximeter may include a combination of green, blue, red and infrared LEDs as light sources and one photodiode for detecting light that has been transmitted through the patient's tissue. Conventionally available pulse oximeters rapidly switch between measurements at different wavelengths, thereby measuring the transmittance of the same region or volume of tissue at different wavelengths. This is called time division multiplexing. The transmittance over time at each wavelength produces a PPG signal of a different wavelength. While contact PPG is considered a substantially non-invasive technique, contact PPG measurements often bring an unpleasant experience, because the pulse oximeter is attached directly to the person and any cable would limit the freedom of movement.

Recently, non-contact remote PPG (RPPG) for unobtrusive measurements has been introduced. RPPG utilizes a light source, or in general, a radiation source, that is located remotely from the person of interest. Similarly, a detector (e.g., a camera or photodetector) may be located remotely from the person of interest. RPPG is also commonly referred to as imaging PPG (iPPG) because it uses an imaging sensor such as a camera. (hereinafter, the terms "remote PPG (RPPG)" and "imaging PPG (iPPG)" may be used interchangeably.) remote photoplethysmography systems and devices are considered unobtrusive in the sense that they do not require direct contact with a person, and in this sense they are well suited for medical applications as well as non-medical everyday applications.

One advantage of camera-based vital sign monitoring over on-body sensors is ease of use. Because it is sufficient to aim the camera at the person, there is no need to attach the sensor to the person. Another advantage of camera-based vital sign monitoring compared to on-body sensors is that: cameras have a higher spatial resolution than touch sensors, which mostly comprise single element detectors.

One of the challenges of RPPG technology is the ability to provide accurate measurements in a volatile environment where unique noise sources are present. For example, in a variable environment such as an in-vehicle environment, during driving (e.g., when passing through shadows of buildings, trees, etc.), the illumination to the driver changes dramatically and abruptly, making it difficult to distinguish the iPPG signal from other changes. In addition, the head and face of the driver move significantly due to various factors such as vehicle movement, the driver looking around inside and outside the vehicle (for oncoming vehicles, looking at rear and side mirrors), and the like.

Several approaches have been developed to enable robust camera-based vital sign measurements. One of these methods uses narrow-band active Near Infrared (NIR) illumination, where NIR illumination greatly reduces the adverse effects of illumination variations. For example, during driving, this approach can reduce the adverse effects of lighting changes (such as abrupt changes between sunlight and shadows, or through street lights and other headlamps) without affecting the ability of the driver to view at night. However, NIR frequencies present new challenges to iPPG, including low signal-to-noise ratio (SNR). Reasons for this include reduced sensitivity of the camera sensor in the near infrared part of the spectrum and a smaller amplitude of the intensity variation associated with blood flow. Therefore, there is a need for an RPPG system that can accurately estimate PPG signals from NIR frequencies.

Disclosure of Invention

It is therefore an object of some embodiments to estimate vital signs of a person with high accuracy. To this end, some embodiments utilize imaging photoplethysmography (iPPG). It is also an object of some embodiments to use a narrow-band Near Infrared (NIR) system and to determine a wavelength range that reduces illumination variation. Additionally or alternatively, some embodiments aim to use NIR monochromatic video (or image sequences) to obtain multidimensional time series data associated with different areas of a person's skin, and to accurately estimate a person's vital signs by processing the multidimensional time series data using a Deep Neural Network (DNN).

Some embodiments are based on the following recognition: the vital signs of a person can be estimated from a NIR monochromatic video or a NIR image sequence. To this end, the iPPG system obtains a sequence of NIR images of the face of a person of interest (also referred to as a "person") and segments each image into multiple spatial regions. Each spatial region includes a small portion of a person's face. The iPPG system analyzes changes in skin color or intensity in each of a plurality of spatial regions to estimate vital signs of a person.

To this end, the iPPG system generates a multi-dimensional time series signal, wherein the dimension of the multi-dimensional signal at each instant corresponds to the number of spatial regions and each point in time corresponds to one image in the image series. The multi-dimensional time series signal is then provided to a Deep Neural Network (DNN) based module to estimate the vital signs of the person. The DNN-based module applies a time-series U-network architecture to the multi-dimensional time-series data, wherein the through-connections of the U-network architecture are modified to incorporate temporal recursion for NIR imaging PPG.

Some embodiments are based on the following recognition: the use of Recurrent Neural Networks (RNNs) to sequentially process multidimensional time series signals in the through layers of a U-network of neural networks can enable more accurate estimation of vital signs of a person.

Some embodiments are based on the following recognition: the sensitivity of the PPG signal to noise in the measurement of the intensity of the human skin (e.g. pixel intensity in the NIR image) is caused at least in part by the independent estimation of the photoplethysmography (PPG) signal from the intensity of the human skin measured at different spatial locations (or spatial regions). Some embodiments are based on the following recognition: at different locations, e.g. at different areas of the human skin, the measurement intensities may experience different measurement noise. When the PPG signal is estimated independently from the intensity at each location (e.g. the PPG signal estimated from the intensity at one skin region is estimated independently from the intensities or estimated signals from other skin regions), the independence of the different estimates may result in the estimator not being able to identify such noise.

Some embodiments are based on the following recognition: the intensities measured at different spatial areas of the human skin may be affected by different and sometimes even uncorrelated noise. Noise includes one or more of lighting changes, movement of a person, and the like. In contrast, heartbeats are a common source of intensity variations that exist in different areas of the skin. Thus, when the independent estimation is replaced by a joint estimation of the PPG signal measured from intensities at different areas of the human skin, the impact of noise on the estimated quality of vital signs can be reduced. In this way, some embodiments are able to extract PPG signals that are common to many skin regions (including regions that may also contain considerable noise), while ignoring noise signals that are not shared between many skin regions.

Some embodiments are based on the following recognition: it may be beneficial to uniformly estimate the PPG signals of different skin areas, since by uniformly estimating the PPG signals of different skin areas, noise affecting the estimation of vital signs is reduced. Some embodiments are based on the following recognition: two types of noise are acting on the intensity of the skin, namely external noise and internal noise. External noise can affect the intensity of the skin due to external factors such as changes in illumination, movement of the person, and resolution of the sensor measuring the intensity. Internal noise affects the intensity of the skin due to internal factors such as different effects of cardiovascular blood flow on the appearance of different areas of the skin of a person. For example, a heartbeat can affect the strength of a person's forehead and cheeks more than the impact of the heartbeat on the nose strength.

Some embodiments are based on the following recognition: both types of noise can be resolved in the frequency domain of the intensity measurement. In particular, the external noise is typically non-periodic or has a periodic frequency that is different from the frequency of the signal of interest (e.g., the pulsatile signal), so the external noise can be detected in the frequency domain. On the other hand, internal noise, while causing intensity variations or time shifts of intensity variations in different areas of the skin, preserves the periodicity of the common source of intensity variations in the frequency domain.

Some embodiments aim to provide accurate estimates of vital signs even in unstable environments where there are severe lighting variations. For example, in an unstable environment such as an in-vehicle environment, some embodiments provide an RPPG system adapted to estimate vital signs of a driver or passenger of a vehicle. However, during driving, the illumination of the face of a person may change drastically. To address these challenges, one embodiment additionally or alternatively uses active in-vehicle illumination in a narrow spectral band in which sunlight, street lamps, and both front and rear light spectral energy are minimal. For example, sunlight reaching the earth's surface has less energy near the 940nm NIR wavelength than it has at other wavelengths due to the presence of water in the environment. Street lamps and car lights typically output light in the visible spectrum with very little power at infrared frequencies. To this end, one embodiment uses an active narrowband illumination source at or near 940nm and a camera filter of the same frequency, which ensures filtering out illumination variations due to ambient illumination. Furthermore, since the narrow band is beyond the visible range, the light source is not perceived by humans and is therefore not distracted by its presence. Furthermore, the narrower the bandwidth of the light source used in active illumination, the narrower the bandpass filter on the camera can be, which further suppresses intensity variations due to ambient illumination.

Thus, one embodiment uses a narrow bandwidth (NIR) light source that illuminates a person's skin in a narrow band of wavelengths including 940nm near infrared; and an NIR camera having a narrowband filter overlapping the wavelength of the narrowband light source to measure the intensity of different areas of the skin within a narrowband.

One embodiment discloses an imaging photoplethysmography (iPPG) system for estimating vital signs of a person from skin images of the person, comprising: at least one processor; and a memory having instructions stored thereon that, when executed by the at least one processor, cause the iPPG system to: receiving a sequence of images of different areas of the skin of a person, each area comprising pixels of different intensities indicative of a colour change of the skin; transforming the image sequence into a multi-dimensional time series signal, each dimension corresponding to a different region of the skin; processing the multi-dimensional time series signal with a time series U-network of nerve cells to generate PPG waveforms, wherein the U-shape of the time series U-network of nerve cells includes a systolic path formed by a series of systolic layers followed by an diastolic path formed by a series of diastolic layers, wherein at least some of the systolic layers downsample their inputs and at least some of the diastolic layers upsample their inputs to form pairs of systolic and diastolic layers of respective resolutions, wherein at least some of the respective systolic and diastolic layers are connected by a pass-through layer. Further, at least one of the through layers includes a recurrent neural network that sequentially processes its inputs. The at least one processor is further configured to: estimating vital signs of the person based on the PPG waveform; and presenting the estimated vital sign of the person.

Another embodiment discloses a method for estimating vital signs of a person, the method comprising: receiving a sequence of images of different areas of the skin of a person, each area comprising pixels of different intensities indicative of a colour change of the skin; transforming the image sequence into a multi-dimensional time series signal, each dimension corresponding to a different region of the skin; processing the multi-dimensional time series signal with a time series U-network of nerve cells to generate PPG waveforms, wherein the U-shape of the time series U-network of nerve cells includes a systolic path formed by a series of systolic layers followed by a diastolic path formed by a series of diastolic layers, wherein at least some of the systolic layers downsample their inputs and at least some of the diastolic layers upsample their inputs forming pairs of systolic and diastolic layers of respective resolution, wherein at least some of the respective systolic and diastolic layers are connected by a pass-through layer, and wherein each of the pass-through layers includes a recurrent neural network that sequentially processes their inputs. The method further comprises the steps of: estimating vital signs of the person based on the PPG waveform; and presenting the estimated vital sign of the person.

Drawings

[ FIG. 1A ]

Fig. 1A shows a block diagram illustrating an imaging photoplethysmography (iPPG) system for estimating vital signs of a person from Near Infrared (NIR) video according to an example embodiment.

[ FIG. 1B ]

Fig. 1B illustrates a functional diagram of an iPPG system according to an example embodiment.

[ FIG. 1C ]

Fig. 1C illustrates steps of a method performed by an iPPG system using NIR video according to an example embodiment.

[ FIG. 1D ]

Fig. 1D shows a block diagram illustrating an imaging photoplethysmography (iPPG) system for estimating vital signs of a person from color videos according to an example embodiment.

[ FIG. 1E ]

Fig. 1E illustrates a functional diagram of an iPPG system that extracts information from a single color channel of a video according to an example embodiment.

[ FIG. 1F ]

Fig. 1F illustrates a functional diagram of a multi-dimensional time series of each color channel stacking each region along a single channel dimension of an iPPG system according to an example embodiment.

[ FIG. 1G ]

Fig. 1G illustrates a functional diagram of an iPPG system combining multi-dimensional time series of multiple color channels into a single multi-dimensional time series according to an example embodiment.

[ FIG. 1H ]

Fig. 1H illustrates a functional diagram of a multi-dimensional time series of each color channel stacking each region along two different channel dimensions for an iPPG system according to an example embodiment.

[ FIG. 1I ]

Fig. 1I illustrates steps of a method performed by an iPPG system using color video according to an example embodiment.

[ FIG. 2A ]

Fig. 2A illustrates a temporal convolution of an input channel operating with a kernel of size 3 with a stride of 1, according to an example embodiment.

[ FIG. 2B ]

Fig. 2B illustrates a temporal convolution of an input channel operating with a kernel of size 3 with a stride of 2, according to an example embodiment.

[ FIG. 2C ]

Fig. 2C illustrates a temporal convolution of an input channel operating with a kernel of size 5 with a stride of 1, according to an example embodiment.

[ FIG. 3]

Fig. 3 illustrates a time convolution with multi-channel input according to an example embodiment.

[ FIG. 4]

Fig. 4 illustrates sequential processing performed by a Recurrent Neural Network (RNN) according to an example embodiment.

[ FIG. 5]

Fig. 5 shows a graph of a comparison of PPG signal spectra obtained using Near Infrared (NIR) and the visible part of the spectrum (RGB), according to an example embodiment.

[ FIG. 6A ]

Fig. 6A illustrates the effect of data enhancement on heart rate estimation using PTE6 (percentage of time with error less than 6 bpm) metrics, according to an example embodiment.

[ FIG. 6B ]

Fig. 6B illustrates the effect of data enhancement on heart rate estimation using Root Mean Square Error (RMSE) metrics, according to example embodiments.

[ FIG. 7]

Fig. 7 shows a comparison of a PPG signal with recursive time series U-net (turneip) estimation for imaging PPG trained using Time Loss (TL) versus a PPG signal estimated by turneip trained using Spectral Loss (SL) for a test subject compared to a corresponding real PPG signal, according to an example embodiment.

[ FIG. 8]

Fig. 8 illustrates a block diagram of an iPPG system according to an example embodiment.

[ FIG. 9]

Fig. 9 illustrates a patient monitoring system using an iPPG system according to an example embodiment.

[ FIG. 10]

Fig. 10 illustrates a driver assistance system using an iPPG system according to an example embodiment.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced without these specific details. In other instances, the apparatus and methods have been shown only in block diagram form in order not to obscure the present disclosure.

As used in this specification and claims, the terms "for example," "for instance," and "such as," and the verbs "comprising," "having," "including," and their other verb forms, when used in conjunction with a listing of one or more components or other items, are each to be construed as open-ended, meaning that the listing is not to be considered as excluding other, additional components or items. The term "based on" means based at least in part on. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting. Any headings used in this specification are for convenience only and do not have a legal or limiting effect.

Fig. 1A shows a block diagram illustrating an imaging photoplethysmography (iPPG) system 100 for estimating vital signs of a person according to an example embodiment. The iPPG system 100 corresponds to a modular framework in which a time series extraction module 101 and a PPG estimator module 109 can be used to generate PPG waveforms (also referred to as "PPG signals") from input images of different areas of human skin. The PPG waveform may also be used to accurately estimate one or more vital signs of a person. In some implementations, one or both of the time series extraction module 101 and the PPG estimator module 109 may be implemented using a neural network.

In some implementations, the iPPG system 100 can comprise a Near Infrared (NIR) light source configured to illuminate human skin, and a camera configured to capture monochromatic video 105 (also referred to as NIR video 105). The NIR video 105 captures at least one body part of one or more persons (such as the person's face). For ease of illustration, assume that the NIR video 105 captures a person's face. The NIR video 105 includes a plurality of frames. Thus, each frame in the NIR video 105 includes an image 107 of the face of the person. In operation, the iPPG system 100 obtains input such as NIR video 105. In some implementations, the image 107 in each frame of the NIR video 105 is segmented into a plurality of spatial regions 103, where the plurality of spatial regions 103 are jointly analyzed to accurately determine the PPG waveform.

Fig. 1D shows a block diagram of an alternative embodiment, in which an iPPG system 100 can comprise a color camera that captures color video such as RGB video 106 (which is so called because it contains red (R), green (G), and blue (B) channels). The RGB video 106 captures at least one body part of one or more persons (such as the face of a person).

For ease of illustration, assume that the RGB video 106 captures a person's face. The RGB video 106 includes a plurality of frames. Thus, each frame in the RGB video 106 includes an image 107 of the person's face. In this embodiment (unlike the embodiment shown in fig. 1C), the image 107 is an RGB image. In operation, the iPPG system 100 obtains inputs such as RGB video 106. In some implementations, the RGB image 108 in each frame of RGB video is divided into a red (R) channel, a green (G) channel, and a blue (B) channel. Each channel is segmented into a plurality of spatial regions 103, wherein the plurality of spatial regions 103 are jointly analyzed to accurately determine the PPG waveform. In some preferred embodiments, the pixel locations corresponding to each spatial region are uniform across the respective color channels.

The segmentation (partitioning) of each image 107 is based on the following insight: the specific region of the body part under consideration contains the strongest PPG signal. For example, the face-specific region (also referred to as "region of interest (ROI)", also simply "region") containing the strongest PPG signal includes regions located near the forehead, cheeks, and chin (as shown in fig. 1A). Thus, image partitioning may be performed using at least one image partitioning technique such as partitioning based on estimated facial marker locations, semantic partitioning, facial parsing, threshold partitioning, edge-based partitioning, region-based partitioning, watershed partitioning, cluster-based partitioning algorithms, and neural networks for partitioning.

Segmentation of each image 107 results in an image sequence comprising a different spatial region of the plurality of spatial regions 103, wherein each spatial region comprises a different part of the skin of the person. For example, in the NIR video 105 and the RGB video 106 of the face of a person, the image 107 in each frame of the video corresponds to the face of the person, and the plurality of spatial areas 103 in the image sequence formed by dividing the image 107 may correspond to areas of the skin of the person. Furthermore, each spatial region of the plurality of spatial regions 103 is used for determining the PPG signal. Since certain parts of the face are occluded, this may be due to one or more occlusions such as hair (such as bang on the forehead), facial hair, objects (such as sunglasses), another body part (such as the hands), head pose or camera pose, making a portion of the face invisible in the image, some areas may not contain skin or may only partially contain skin, which may destroy or impair the signal quality from these areas.

Some embodiments are based on the following recognition: in the measurement of the intensity of the skin of a person (e.g. the intensity of pixels in an image), the sensitivity of the PPG signal to noise is caused at least in part by the PPG signal being estimated independently from the intensity of the skin of the person measured at different spatial locations (or spatial regions). Some embodiments are also based on the following recognition: at different locations, e.g. at different areas of the human skin, the measurement intensities may experience different measurement noise. When the PPG signal is estimated independently from the intensity at each spatial region (e.g. the PPG signal estimated from the intensity at one skin region is estimated independently from the intensities or estimated signals from other skin regions), the independence of the different estimates may result in the estimator not being able to identify such noise that affects the accuracy of determining the PPG signal.

Noise may be due to one or more of lighting variations, human movement, and the like. Some embodiments are also based on the following recognition: heartbeats are a common source of intensity variations that exist in different areas of the skin. Thus, when the independent estimate is replaced with a joint estimate of the PPG signal measured from intensities at different areas of the human skin, the impact of noise on the quality of the vital sign estimate can be reduced.

Thus, the iPPG system 100 jointly analyzes multiple spatial regions 103 to estimate vital signs to reduce the impact of noise, wherein the vital signs are one or a combination of a pulse rate of a person and heart rate variability of the person (also referred to as "heart beat signals"). In some embodiments, the vital sign of the person is a one-dimensional signal at each instant in the time series.

Some embodiments are based on the following recognition: vital signs can be accurately estimated by employing temporal analysis. Thus, the iPPG system 100 is configured to extract at least one multi-dimensional time series signal from a sequence of images corresponding to different areas of human skin, wherein the time series signal is used to determine the PPG signal to accurately estimate vital signs.

To this end, the iPPG system 100 uses a time sequence extraction module 101.

And a time sequence extraction module:

in some implementations, the time series extraction module 101 is configured to receive an image series of a plurality of frames of the NIR video 105 and extract a multi-dimensional time series signal from the image series. In some implementations, the temporal sequence extraction module 101 is further configured to segment the image 107 of the frame from the NIR single color video 105 into a plurality of spatial regions 103 and generate a multi-dimensional temporal sequence corresponding to the plurality of spatial regions 103.

In other embodiments, the time series extraction module 101 is configured to receive an image series of a plurality of frames of the RGB video 106 and extract a multi-dimensional time series signal from the image series. In some implementations, the time series extraction module 101 is further configured to divide the image 107 of the frame from the RGB video 106 into a red (R) channel, a green (G) channel, and a blue (B) channel. In some embodiments, the time series extraction module 101 is further configured to divide each of the R, G, B channels of the image into a plurality of spatial regions 103 and generate a multi-dimensional time series corresponding to the plurality of spatial regions 103.

The images 107 in the image sequence may contain different areas of human skin, wherein each area comprises pixels of different intensities indicative of a change in skin color. Fig. 1A shows a skin region (facial region) located on the face, but it should be understood that the various embodiments are not limited to the use of the face. In some embodiments, image sequences corresponding to other areas of exposed skin (such as a person's neck or wrist) may be obtained and processed by the time series extraction module 101.

In some implementations, each dimension of the multi-dimensional time series signal obtained from the NIR monochromatic video 105 corresponds to a different spatial region among a plurality of spatial regions of the skin of the person in the image 107.

In some implementations, each dimension of the multi-dimensional time series signal obtained from the RGB video 106 corresponds to a different spatial region and a different color channel among a plurality of spatial regions of the skin of the person in the image 107.

Furthermore, in some embodiments, each dimension is a signal from a region of interest (ROI) of multiple spatial regions of human skin that are explicitly tracked (alternatively, explicitly detected in each frame). Tracking (alternatively, detecting) reduces the amount of motion-related noise. However, due to factors such as landmark localization errors, illumination variations, 3D head rotations, and deformations such as facial expressions, the multi-dimensional time series still contains significant noise.

To recover the signal of interest (PPG signal) from the noisy multi-dimensional time series signal, the multi-dimensional time series signal is given to a PPG estimator module 109.

The PPG estimator module:

the PPG estimator module 109 is configured to recover and output 111 a PPG signal from the noisy multi-dimensional time series signal. Furthermore, based on the PPG signal, vital signs of the person are determined.

Given the half-cycle nature of the time series signal obtained by the PPG estimator module 109, the architecture of the PPG estimator module 109 is designed to extract the temporal features at different temporal resolutions. To this end, the PPG estimator module 109 is implemented using a neural network such as a Recurrent Neural Network (RNN), a Deep Neural Network (DNN), or the like.

In some embodiments, the present disclosure proposes a time-series with recursion U-network (turneip) architecture for imaging PPG for PPG estimator module 109. Fig. 1B illustrates a turneip architecture based on a U-network architecture coupled to an RNN architecture.

Some embodiments are based on the following recognition: u-networks are convolutional network architectures that have been used in image processing applications such as image partitioning. The U-net architecture is a "U" -shaped architecture, wherein the U-net architecture includes a contracted path on the left side of the U-net architecture and an expanded path on the right side of the U-net architecture. The U-net architecture can be broadly classified into an encoder network corresponding to a contracted path and a decoder network corresponding to an expanded path, with the encoder network followed by the decoder network.

The encoder network forms the first half of the U-net architecture. In image processing applications, which typically use a U-grid architecture, the encoder consists of a series of spatially convolved layers, and may have a maximally pooled downsampling layer to encode the input image into a plurality of different levels of feature representation.

The decoder network forms the second half of the U-net architecture and includes a series of convolutional layers and an upsampling layer. The goal of the decoder network is to semantically project the (lower resolution) features learned by the encoder network back into the original (higher resolution) space. In image processing applications that typically use a U-shaped grid, the convolution layer uses spatial convolution, while the input and output spaces are image pixel spaces.

Some embodiments are based on the following recognition: the input of the PPG estimator module 109 (also referred to as "PPG estimator network") is a multi-dimensional time series, while the desired output is a one-dimensional time series of vital signs. Thus, in some preferred embodiments, the convolution layers of the encoder and decoder subnetworks of the time series U-network 109a use time convolution.

Some embodiments are based on the following further insight: recurrent Neural Networks (RNNs) are a class of Artificial Neural Networks (ANNs) in which connections between nodes form a directed graph along a time series. The directed graph allows the RNN to exhibit time-dynamic behavior. Unlike feed-forward neural networks, RNNs can use their internal states (memories) to process input variable length sequences. Thus, the RNN is able to remember important features entered in the past, which allows the RNN to determine the temporal pattern more accurately. Thus, RNNs can form a deeper understanding of the sequence and its context. Thus, RNNs can be used for sequence data, such as time series.

In some embodiments of the proposed turip architecture of the iPPG system 100, the U-grid architecture is applied to time series data. In some embodiments, the through-connections merge 1×1 convolutions. Unlike previous U-networks, in turn ip, through-connections are modified by using RNNs to merge temporal recursions. Thus, the PPG estimator module 109 includes a time series U-net neural network (also referred to as "U-net") 109a coupled to a Recurrent Neural Network (RNN) 109 b. The U-net 109a and RNN 109b are coupled to process the multi-dimensional time series data to accurately determine PPG waveforms, where the PPG waveforms are used to estimate vital signs of a person. Further details regarding the operation of the proposed iPPG system 100 using the turneip architecture will be described in more detail below with reference to fig. 1B-1J.

Fig. 1B illustrates a functional diagram of an iPPG system 100 according to an example embodiment. FIG. 1B is described in connection with FIG. 1A. The iPPG system 100 initially receives one or more videos of a body part (e.g., face) of a person. The one or more videos may be Near Infrared (NIR) videos. In some embodiments, the iPPG system 100 comprises a NIR illumination source and a camera, wherein the NIR illumination is configured to illuminate a body part of a person with NIR light such that the camera can record one or more NIR videos of a specific body part of the person. One or more NIR videos are used to determine PPG waveforms using a turip architecture.

To this end, for each NIR video 105 of the one or more videos, the ippg system 100 obtains an image (e.g., image 107) from each of a sequence of image frames of the NIR video 105. Each image is segmented or partitioned into a plurality of spatial regions (e.g., spatial region 103) resulting in a sequence of images whose spatial regions correspond to different regions of the body part. Segmentation of the image 107 is performed such that each spatial region comprises a specific region of the body part that may strongly indicate the PPG signal. Thus, each spatial region of the plurality of spatial regions 103 is a region of interest (ROI) for determining the PPG signal. Further, for each spatial region, a time series signal is derived using the time series extraction module 101.

In an example embodiment, for each NIR video 105, the temporal sequence extraction module 101 extracts a 48-dimensional temporal sequence corresponding to pixel intensities over time of 48 face Regions (ROIs), where the face regions correspond to the plurality of spatial regions 103. In some implementations, the multi-dimensional time series signal may have more or less than 48 dimensions corresponding to more or less than 48 facial regions.

In some implementations, to extract ROIs in the image that are associated with a particular body part of a person, a plurality of landmark locations corresponding to the particular body part of the person are located in each image frame 107 of the video. Thus, the plurality of landmark positions may vary depending on the body part determined for the PPG signal. In an example embodiment, when a person's face is used to determine the PPG signal, 68 landmark locations (i.e., 68 facial landmarks) corresponding to the person's face are located in each image frame 107 of the video.

Some embodiments are based on the following recognition: due to imperfect or inconsistent landmark localization, motion dithering of estimated landmark locations in subsequent frames causes the boundaries of the region to dither from one frame to the next, which adds noise to the extracted time series. To mitigate the extent of this noise, the landmark locations are time-smoothed before extracting the ROI (e.g., 48 face regions).

Thus, in some embodiments, the plurality of landmark locations are smoothed across time using a smoothing technique, such as a moving average technique, before the ROI is extracted from the plurality of landmark locations. Specifically, a predetermined length of time kernel is applied to the plurality of landmark locations over time to determine the location of each landmark in each video frame image 107 as a weighted average of estimated locations of landmarks in previous and subsequent frames within a time window corresponding to the kernel length.

For example, in one embodiment, a moving average of kernels of length 11 frames is used to smooth 68 landmark positions. The smoothed landmark locations in each frame of the NIR video 105 (i.e., in each image 107) are then used to extract 48 ROIs in the frame around the forehead, cheeks and chin. Then, for the frame, the average intensity of the pixels in each of the 48 spatial regions is calculated. In this way, intensity values for each region (or ROI) of the plurality of spatial regions 103 are extracted from each image, wherein the intensity values from the plurality of spatial regions 103 of the frame sequence 107 (e.g., a sequence of 314 frames) form a multi-dimensional time sequence.

The time series extraction module 101 is configured to convert a sequence of images 107 corresponding to the plurality of spatial regions 103 into a multi-dimensional time series signal. Some embodiments are based on the following recognition: spatial averaging reduces the effects of noise sources such as quantization noise of cameras that capture video (NIR video 105 or RGB video 106) and minor distortions due to human head and face movements. To this end, the pixel intensities of pixels from each of a plurality of spatial regions (also referred to as "different spatial regions") 103 at a time instant are averaged to produce a value for each dimension of the multi-dimensional time series signal at that time instant.

In some implementations, the time series extraction module 101 is further configured to time window (or partition) the multi-dimensional time series signal. Thus, there may be a plurality of sections of the multi-dimensional time series signal, wherein at least some portions of each section of the plurality of sections overlap with subsequent sections of the plurality of sections, thereby forming an overlapping section sequence. Furthermore, the multi-dimensional time series corresponding to each segment is normalized before submitting the multi-dimensional time series signal to PPG estimator module 109, wherein PPG estimator module 109 may use time series U-net 109a to process each segment in the overlapping sequence from the multi-dimensional time series signal.

The windowed sequence is a particular duration with a particular frame step during reasoning (e.g., 10 seconds duration with a 10 frame step (300 frames at 30 fps) during reasoning), where the step indicates the number of frames (e.g., 10 frames) that are time-shifted between subsequent windowed sequences (e.g., 10 seconds windowed sequences).

In the example case where the vital sign of the person to be estimated is a heartbeat signal, the heartbeat signal is locally periodic, with the period of the heartbeat signal being varied over time. In this case, some embodiments are based on the following recognition: the 10 second window is a good compromise duration to extract the current heart rate.

Some embodiments are based on the following recognition: longer strides are more effective for training using a larger data set. Thus, the stride (in frames) used for windowing during training may be longer (e.g., 60 frames) than the stride (e.g., 10 frames) used for windowing during reasoning. The stride length in frames may also vary depending on the vital sign of the person to be estimated.

In some implementations, a preamble of a particular duration (e.g., 0.5 seconds) is added to each window. For example, adding multiple additional frames (e.g., 14) immediately before the window starts results in a multi-dimensional time series of longer duration (e.g., 314 frames).

In some implementations, when the input is NIR video 105, a multi-dimensional time series (e.g., a 48-dimensional time series) is fed as a channel into PPG estimator module 109. The PPG estimator module 109 includes a series of layers associated with a time series U-net 109a and RNN 109b forming a turip architecture. Channels corresponding to the multi-dimensional time series signal are combined together during the forward pass through a series of layers. In the PPG estimator module 109, a time series U-net 109a together with RNN 109b maps the multi-dimensional time series signal to the desired PPG signal. For each windowed sequence of multi-dimensional time series signals (e.g., 10 second window), the turip architecture extracts convolution features at a particular time resolution (e.g., three time resolutions). The specific time resolution may be predefined.

Furthermore, in some embodiments, the turneip architecture downsamples the input time series by a first factor and then by a second factor, where the second factor is an additional factor. The first and second factors for downsampling the input time series may be predefined (e.g., the first factor may be 3 and the second factor may be 2). The PPG estimator module 109 then estimates the desired PPG signal in a deterministic manner.

TURNIP architecture:

the turneip architecture is a neural network (e.g., DNN) based architecture that trains on at least one dataset to accurately determine the PPG signal based on multi-dimensional time series data. The time series U-web 109a includes a contracted path formed from a series of contracted layers followed by an expanded path formed from a series of expanded layers. The series of shrink layers is a combination of a convolutional layer, a max-pooling layer, and a discard layer. Similarly, a series of expansion layers is a combination of a convolutional layer, an upsampling layer, and a discard layer. At least some of the shrink layers downsamples their input multi-dimensional time series signals and at least some of the expansion layers upsample their inputs to form pairs of shrink and expansion layers of respective resolutions. Furthermore, at least some of the shrink and expansion layers are connected by a pass-through layer. The plurality of shrink layers form a coding sub-network that can be regarded as coding its input data into a sequence with a lower temporal resolution. In another aspect, the plurality of expansion layers form a decoding sub-network that may be considered to decode input data encoded by the encoding network. Furthermore, at least at some resolutions, the coding sub-network and the decoding sub-network are connected by a through connection. In parallel with the 1 x 1 convolution pass-through connection, a specific recursive pass-through connection is also included. The RNN109b is used to implement a specific recursive pass-through connection. The RNN109b processes its inputs sequentially, and the RNN109b is contained in each pass-through layer.

In a preferred embodiment, RNN 109b is implemented using a Gated Recursion Unit (GRU) 113 architecture to provide temporal recursion features. In other embodiments, the RNN 109b may be implemented using a different RNN architecture, such as a Long Short Term Memory (LSTM) architecture. Some embodiments are based on the following recognition: the GRU is an evolution of the standard RNN. The GRU uses gates to control information flow and unlike LSTM, the GRU has no separate cell states (C _t ). GRU has only hidden state (H) _t ). The GRU obtains at each timestamp t the input Xt and the hidden state H from the previous timestamp t-1 _t-1 . It then outputs a new hidden state H _t It is then passed to the GRU at the next timestamp. There are mainly two gates in the GRU. The first gate is a reset gate and the other is an update gate. Some embodiments are also based on the following recognition: with other types of RNNs (such as long and short term memory (LS)TM) network), the GRU trains faster due to its simpler architecture.

Shrink path:

in the time series U-net 109a, the shrink path is formed from a series of shrink layers, where each shrink layer includes a combination of one or more of a convolutional layer, a single downsampled convolutional layer, and a discard layer. The discard layer is a regularized layer that reduces overfitting of the layer with which it is used (e.g., a convolutional layer) and improves the generalization ability of the corresponding layer. The discard layer discards the output of the layer (e.g., convolutional layer) with which it is used with a certain probability p, also called discard rate. The discard rate may be calculated or predefined in real time based on a training data set used to train the turneip architecture. In an example embodiment, the discard rate (or p) for each discard layer is equal to 0.3.

Alternatively, in some other embodiments, the contracted path of the time series U-net 109a may not include a discard layer. In this embodiment, the shrink path is formed from a series of shrink layers, where each shrink layer includes a combination of one or more of a convolutional layer only and a single downsampled convolutional layer.

Furthermore, in some embodiments of the turneip architecture, a series of shrink layers is formed from 5 shrink layers. In other embodiments, there may be more than 5 shrink layers, and in further embodiments, there may be less than 5 shrink layers. Of the 5 shrink layers, the first shrink layer 116a comprises two convolution layers. The first shrink layer 116a processes its input, where the input is a multi-dimensional time series signal provided as a plurality of channels, and the multi-channel output generated by the first shrink layer 116a is submitted to one of the layers in the expansion path (e.g., the fourth expansion layer 118 d). Note that while we refer to all layers in the shrink path as "shrink layers" and all layers in the expansion path as "expansion layers," in some embodiments, not every shrink layer actually shrinks the length of its input sequence. For example, in one embodiment shown in fig. 1B, the sequence output from the first shrink layer 116a has substantially the same length as the sequence input to the first shrink layer 116 a. This is because for the convolution performed in the first shrink layer, stride=1. Similarly, not every "expansion layer" actually expands the length of its input sequence. For example, the input and output of the fourth expansion layer have substantially the same length.

Further, each of the second shrink layer 116b, the third shrink layer 116c, and the fourth shrink layer 116d includes a convolutional layer (sometimes referred to as a "single downsampling layer," although as mentioned above, not every downsampling layer actually downsamples its input length), followed by a discard layer having a particular discard rate (e.g., p=0.3). In one embodiment, as shown in fig. 1B, the second shrink layer 116B (whose convolution has a stride=3) and the fourth shrink layer 116d (whose convolution has a stride=2) each downsamples their inputs by a factor equal to their stride, while the third shrink layer 116c and the fifth shrink layer 116e do not downsample their inputs. In this embodiment, the downsampling is achieved in steps of the convolution of each downsampling layer, but in alternative embodiments, the downsampling may be achieved using other means such as maximum pooling or average pooling. The second shrink layer 116b receives an input channel corresponding to the multi-dimensional time series signal extracted by the time series extraction module 101 and submits its output to the third shrink layer 116c and the corresponding pass-through layer 113a. Further, each of the third shrink layer and the fourth shrink layer receives a respective input from a previous shrink layer and submits a respective output to a respective next shrink layer and a respective pass-through layer.

The fifth shrink layer, and the last shrink layer in a series of five shrink layers, comprises two convolutional layers, followed by a discard layer having a particular discard rate. The fifth shrink layer receives input from the fourth shrink layer and submits its output to one of the expansion layers in the expansion path (e.g., the first expansion layer 118 a).

Expansion path:

in some embodiments, the expansion path includes a series of five expansion layers. In one such embodiment, as shown in fig. 1B, the first dilating layer 118a is configured to perform upsampling, cascading of outputs of its corresponding through layer 113c, and convolution of its input time series, among a series of 5 dilating layers. Similarly, the third dilating layer 118c performs upsampling, cascading of outputs of the through layer 113a corresponding thereto, and convolution of its input time series. Each of the second and fourth dilating layers 118b, 118d is configured to perform a concatenation of outputs of its corresponding pass-through layers and a convolution of its input time series. The fourth expansion layer additionally includes a discard layer having a specific discard rate (e.g., p=0.3). The fifth expansion layer consists of a convolution layer followed by a discard layer with a specified discard rate. To upsample the input data at the first and third dilating layers 118a, 118c, each of the two dilating layers uses an up-converter operation to produce the upsampled data at its corresponding input. Furthermore, up-sampled data is used for concatenation, and temporal convolution is each of these dilated layers.

Still referring to fig. 1B, the output of the time series extraction module 101 as a multi-dimensional time series is provided as a channel to the PPG estimator module 109. Thus, each shrink layer processes multiple (chan_in) input channels into multiple (chan_out) output channels for a kernel of a particular size (e.g., a kernel of size k=3) and a kernel of a particular stride (e.g., stride s=1). In some example embodiments, the first shrink layer 116 may have chan_in=48 input channels and chan_out=64 output channels. The output of the first contraction layer 116a is submitted to the fourth expansion layer 118d.

Similarly, for the second shrink layer 116b, the third shrink layer 116c, the fourth shrink layer 116d, and the fifth shrink layer 116e, input channels, output channels, cores, and strides are specified.

In one embodiment shown in fig. 1B, for example, the convolution performed by the second shrink layer 116B has 48 input channels and 64 output channels, with a kernel size k=9 and a stride s=3. The output of the second shrink layer 116b is fed to the third shrink layer 116c and to the first through layer 113a.

Each pass-through layer, such as the first pass-through layer 113a, is comprised of a 1 x 1 convolutional layer 117 and an RNN, such as the GRU 113, the outputs of each of which are concatenated 115 and then passed to the corresponding layer of the dilation path.

The third shrink layer 116c has a convolution of 64 input channels and 128 output channels with kernel size k=7 and step s=1. The output of the third shrink layer 116c is provided to the fourth shrink layer 116d of the shrink path and to the second pass-through layer 113b, the output of the second pass-through layer 113b being transferred to the corresponding layer 118b of the expansion path. The fourth shrink layer 116d has 128 input channels and 256 output channels and uses a convolution with a kernel size of 7 and a stride of 1; the output of the fourth shrink layer 116d is provided to the fifth shrink layer 116e of the shrink path and to the third pass-through layer 113c, which third pass-through layer 113c transfers its output to the corresponding expansion layer 118b. In the final stage of the shrink path, the fifth shrink layer 116e has 256 input channels and 512 output channels and a convolution with a kernel size of 7 and a stride of 1. Further, the output of the fifth contracted layer 116e is provided to the first expanded layer 118a of the expanded path.

The first expansion layer 118a takes two inputs, with the first input taken from the fifth contraction layer 116e and the second input taken from the output of the third pass-through layer 113 c. The first expansion layer 118a processes its input and passes its output to the second expansion layer 118b. The second expansion layer 118b also takes two inputs, where the first input corresponds to the output of the first expansion layer 118a and the second input corresponds to the output of the second pass-through layer 113 b.

Similarly, the first input of the third expansion layer 118c corresponds to the output of the second expansion layer 118b, and the second input of the third expansion layer 118c corresponds to the output of the first through layer 113 a. Further, the output of the third expansion layer 118c is provided to the fourth expansion layer 118d.

The fourth expansion layer 118d takes a first input from the third expansion layer 118c and a second input from the first contraction layer 116 a. The output of the fourth expansion layer is provided to a fifth expansion layer that performs channel reduction (e.g., from 64 channels to 1 channel), followed by a discard layer.

In some embodiments, the output of the fifth dilating layer 118e is the final output of the PPG estimator module 109. This output (e.g., a one-dimensional time series of estimated PPG waveforms) is used to obtain the output 111 of the iPPG system 100.

At each time scale, the convolution layer of the time series U-net 109a processes all samples from the time series window (e.g., 10 second window) in parallel. In contrast to the calculation of each output time step that results in each convolution, which may be performed in parallel with the corresponding calculation of the other output time steps of the convolution, the proposed RNN layer (e.g., the GRU layer 113) processes the time samples sequentially. This time recurrence has the effect of expanding the time receptive field at each layer of the expanded path of the time series U-net 109 a.

For example, in the embodiment shown in FIG. 1B, after GRU 113 has run through all time steps in the 10 second window, the resulting hidden state sequence is concatenated 115 with the output of the more standard through-layer (1X 1 convolution) 117. For each 10 second window fed to the GRU 113, the hidden state of the GRU 113 is reinitialized.

Further details regarding the steps performed by the iPPG system 100 to determine a PPG signal are described below with reference to fig. 1C.

Fig. 1C illustrates steps of a method 119 performed by the iPPG system 100 according to an example embodiment. At step 119a, a NIR monochrome video (e.g., NIR video 105) of a person is received. The NIR video 105 may include a person's face or any other body part of the person whose skin is exposed to a camera that records video. The iPPG system 100 can comprise a NIR light source configured to illuminate human skin for recording NIR video 105. Furthermore, the iPPG system 100 can be configured to measure intensities indicative of color changes of skin at different times, wherein each time corresponds to a video frame, i.e. an image in a sequence of images.

For this purpose, the image corresponding to each frame of the input NIR video is partitioned into different areas, wherein the different areas correspond to different parts of the human skin in the image. Landmark detection may be used to identify different areas of a person's skin. For example, if the body part of the person is the face of the person, facial landmark detection may be used to obtain different areas of the face.

At step 119b, a sequence of images including different areas of human skin is received by the time series extraction module 101 of the iPPG system 100.

In step 119c, the image sequence is transformed into a multi-dimensional time series signal by the time series extraction module 101. To this end, the pixel intensities of pixels from each of a plurality of spatial regions 103 (also referred to as "different spatial regions") at a time instant (e.g., in one video frame image 107) are averaged to produce a value for each dimension of the multi-dimensional time series signal at the time instant.

In step 119d, the multidimensional time-series signal is processed by a time-series U-network 109a coupled to recurrent neural network 109b in a through layer forming a turneip architecture. The multi-dimensional time series signal is processed by different layers of the turneip architecture to generate PPG waveforms, which in some embodiments are represented as a one-dimensional (1D) time series.

In step 119e, vital signs of the person, such as heart beat or pulse rate, are estimated based on the PPG waveform. In some embodiments, the output 111 of the iPPG system 100 comprises vital signs.

In this way, the PPG estimator module 109 estimates the PPG signal from the multi-dimensional time series signal extracted from the NIR video 105. To this end, the multidimensional time series signal is time convolved at each layer of the turneip architecture. More details regarding the temporal convolution are provided below with reference to fig. 2A-2C. Furthermore, in some embodiments, the estimated vital sign signal is presented on an output device, such as a display device. In some embodiments, the estimated vital signals may be further used to control the operation of one or more external devices associated with the person whose vital signals are estimated.

Extracting a time sequence from a multi-channel video:

in some implementations, such as those shown in fig. 1A and 1C, the iPPG system 100 or the method 119 begins with a single channel video (such as the single channel NIR video 105) as input. While these figures and the corresponding descriptions above apply to single channel NIR video, it should be understood that the same concepts may be similarly applied to other single channel video, such as video collected using a monochrome gray scale camera sensor or a thermal infrared camera sensor.

However, in other implementations, the iPPG system or method starts with multi-channel video. Discussion of multi-channel images in this document mainly discusses RGB video (i.e., video with red, green, and blue channels) as an example of multi-channel video. However, it should be understood that the same concepts may be similarly applied to other multi-channel video inputs, such as multi-channel NIR video, RGB-NIR four-channel video, multi-spectral video, and color video stored using different color space representations other than RGB (such as YUV video), or different arrangements of RGB color channels (such as BGR).

For multi-channel video (such as RGB video), there are a variety of methods by which the time series extraction module extracts the time series from the multi-channel video, and different embodiments extract the time series from the multi-channel video using different methods. Fig. 1E-1H illustrate some of these methods each used in different embodiments of the invention.

Fig. 1E shows an example implementation where the input is RGB video 106. In this embodiment, all color channels except one are ignored, and the time series extraction module 101 extracts the multi-dimensional time series from only a single channel (e.g., a green (G) channel) using methods similar to those described herein for extracting multi-dimensional time series from single channel video, such as NIR video. The green channel is used because of the three color channels of red, green, and blue, the green channel intensity has been shown to be the channel most affected by the change in blood volume detected by iPPG. As in the case of a single color, the output of the time series extraction module 101 is fed into the PPG estimator 109. Each dimension of the multi-dimensional time series is fed into the PPG estimator 109 by treating it as an input channel. A disadvantage of this approach is that it ignores all information in the other two color channels. For example, it has been demonstrated that using three color channels instead of one can help to distinguish intensity variations due to pulsatile blood volume variations (which have a greater impact on the green channel than the other two color channels) from intensity variations due to interfering factors such as body motion and global illumination variations (e.g., which may affect all three color channels more equally).

Fig. 1F shows the following example embodiments: wherein a multi-dimensional time series (e.g., a 48-dimensional time series with corresponding 48 ROIs) is extracted from each of R, G and B channels using methods similar to those described herein for extracting multi-dimensional time series from single channel video, such as NIR video. This results in extracting a multi-dimensional time series (e.g., a 48-channel time series) from each of the red channel ("R chan"), the green channel ("G chan"), and the blue channel. These three multi-channel time sequences are concatenated along the channel dimension to form a single multi-dimensional time sequence (e.g., with 3·48=144 channels) that is fed into the PPG estimator 109. Each dimension of the multi-dimensional time series is fed into the PPG estimator 109 by treating it as an input channel. One drawback of this approach is that the concatenation confuses the correspondence between channels obtained from the same ROI from different channels.

Fig. 1G shows another example embodiment: wherein a multi-dimensional time series (e.g., a 48-dimensional time series with corresponding 48 ROIs) is extracted from each of R, G and B channels using methods similar to those described herein for extracting multi-dimensional time series from single channel video, such as NIR video. This again results in a multi-dimensional time series (e.g., 48-channel time series) extracted from each of the red channel ("R chan"), the green channel ("G chan"), and the blue channel. In this case, the multi-dimensional time series from each color channel R, G and B are linearly combined to form a single multi-dimensional time series fed into PPG estimator 109, the dimensions of which are the same as the dimensions of the multi-dimensional time series of each channel (e.g., 48 channels x 314 time steps). In some embodiments, coefficients for linear combinations are learned in conjunction with parameters of the neural network. In other embodiments, the coefficients may be selected a priori, e.g., based on a standard color space transformation from RGB to gray scale. Each dimension of the multi-dimensional time series is fed into the PPG estimator 109 by treating it as an input channel. One disadvantage of this approach is that it can only learn a single linear combination to combine three color channels into one. The same linear combination must be used for all areas and the linear combination is data independent (e.g. all subjects for all skin hues must use the same linear combination under all lighting conditions).

Fig. 1H shows the following alternative embodiment: wherein a multi-dimensional time series (e.g., a 48-dimensional time series with corresponding 48 ROIs) is extracted from each of R, G and B channels using methods similar to those described herein for extracting multi-dimensional time series from single channel video, such as NIR video. This again results in a multi-dimensional time series (e.g., 48 channel time series) extracted from each of the red channel ("R chan"), the green channel ("G chan"), and the blue channel. In this case, the multi-dimensional time series from each color channel R, G and B is shaped into a three-dimensional (3D) array, also referred to as a 3D tensor. The three dimensions of the array correspond to time (e.g., 314 time steps), facial regions (e.g., 48 region channels), and color channels (e.g., 3 color channels). The array forms an input to a PPG estimator 109. The convolution kernels of the first shrink layer and the second shrink layer are constructed such that the color dimension folds into a single dimension at the output of each layer. This method can overcome the drawbacks of the methods described in fig. 1E to 1H.

Fig. 1I illustrates steps of a method 120 performed by the iPPG system 100 according to an example embodiment, wherein a multi-channel video, such as an RGB video, is received 120 a. In step 120a, an RGB video (e.g., RGB video 106) of a person is received. The RGB video 106 may include a person's face or any other body part of the person whose skin is exposed to a camera that records the video. Furthermore, the iPPG system 100 can be configured to measure intensities indicative of skin color changes at different moments in time, wherein each moment in time corresponds to a video frame, i.e. an image in a sequence of images.

At step 120b, a sequence of images including different areas of human skin is received by the time series extraction module 101 of the iPPG system 100.

In step 120c, the image sequence is transformed into a multi-dimensional time series signal by the time series extraction module 101. To this end, the pixel intensities of the respective color channels of the pixels from each of a plurality of spatial regions 103 (also referred to as "different spatial regions") at a certain instant (e.g., in one video frame image 107) are averaged to produce a value for each dimension of the multi-dimensional time series signal of that color channel at that instant. For example, a single multi-dimensional time series is extracted from the color channel multi-dimensional time series using one of the methods described in fig. 1E to 1H.

In step 120d, the multidimensional time-series signal is processed by a time-series U-network 109a coupled to recurrent neural network 109b in a through layer forming a turneip architecture. The multi-dimensional time series signal is processed by different layers of the turneip architecture to generate PPG waveforms, which in some embodiments are represented as a one-dimensional (1D) time series.

In step 120e, vital signs of the person, such as heart beat or pulse rate, are estimated based on the PPG waveform. In some embodiments, the output 111 of the iPPG system 100 comprises vital signs.

In this way, the PPG estimator module 109 estimates the PPG signal from the multi-dimensional time series signal extracted from the NIR video 106. To this end, the multidimensional time series signal is time convolved at each layer of the turneip architecture. More details regarding the temporal convolution are provided below with reference to fig. 2A-2C. Furthermore, in some embodiments, the estimated vital sign signal is presented on an output device, such as a display device. In some embodiments, the estimated vital signals may be further used to control the operation of one or more external devices associated with the person whose vital signals are estimated.

Fig. 2A illustrates a temporal convolution of an input channel 201 operated by a kernel of size 3 with a stride of 1, according to an example embodiment. Fig. 2B illustrates a temporal convolution of input channel 201 operated by a kernel of size 3 with a stride of 2, according to an example embodiment. Fig. 2C illustrates a temporal convolution of input channel 201 operated by a kernel of size 5 with a stride of 1, according to an example embodiment.

In fig. 2A, it is assumed that a time series 201 in a single input channel (ch_in=1) is obtained from one of the convolution layers of the time series U-net 109a (e.g., the convolution layer in the first shrink layer), where the length of the input channel 201 is 10. The input channel 201 corresponds to one dimension of a multi-dimensional time series fed by the time series extraction module 101 to the PPG estimator module 109 (e.g., the input channel 201 is a one-dimensional time series). Furthermore, the length of the corresponding output 203 channel is changed based on the stride value used to operate on the input channel.

Each block plotted in the graph of input channel x (t) 201 is set to represent the value of that channel at one time step. Further, each coefficient of the kernel is designated by k (τ). Assume that the size of the kernel used for convolution by the convolution layer with the input channel 201 is 3. Since the kernel size is 3, the kernel includes 3 coefficients corresponding to τ= -1, 0, and 1, respectively. Further, assume that the kernel traverses (or shifts) on the input channel 201 with a stride value s=1 (the stride value may also be referred to as a "stride length"). Further, a convolved output is obtained in the output channel y (t) 203. Thus, the time convolution is calculated as follows:

y(t)＝∑ _τ x(t+τ)k(τ), (1)

wherein τ= -1, 0 and 1. Therefore, the kernel coefficients (also referred to as "learnable filters") are k (-1), k (0), k (1).

Similarly, in fig. 2B and 2C, the time convolution is calculated using equation (1). In fig. 2B, the kernel size is 3, which is the same as the kernel size used in fig. 2A. However, the stride length increases to 2. Thus, the length of the output time series (in the channel y (t)) decreases. In this way, the convolution in fig. 2B downsamples the input by a factor of 2.

Fig. 3 illustrates a time convolution with multi-channel input according to an example embodiment. The time convolution with multi-channel input is based on a time convolution with single channel input as shown in fig. 2A-2C. The PPG estimator module 109 uses a time convolution with a multi-channel input corresponding to the multi-dimensional time series signal output by the time series extraction module 101 or output by a previous layer of the PPG estimator network 109.

In fig. 3, three input channels are considered for ease of illustration. However, the number of input channels for convolution in the PPG estimator module 109 is the dimension of the multi-dimensional time series input to the convolution layer. For example, if the multi-dimensional time series signal has 48 dimensions corresponding to 48 face ROIs, the number of convolved channels input into the first two shrink layers is also equal to 48.

Thus, the three input channels are channel 1 of the input feature map (also referred to as a "first channel") 301, channel 2 of the input feature map (also referred to as a "second channel") 303, and channel 3 of the input feature map (also referred to as a "third channel") 305. Let the first channel 301 be denoted x (t), the second channel 303 be denoted y (t), and the third channel 305 be denoted z (t), and the output channel 307 generated after time convolution of the multiple channels (301-305) be denoted o (t). Furthermore, a kernel size of 3 is provided which is shifted by a step value of 4 frames on each of the three input channels (301-305). For each input channel, a time convolution of a plurality of input channels (301-305) is calculated based on equation (1). The temporal convolution is performed using as many filters as there are channels outputting the feature map. In some implementations, a learnable bias is also added to the output of each filter. In some embodiments, at least one of the temporal convolutions is followed by a nonlinear activation function, such as a modified linear unit (RELU) or sigmoidal (sigmoidal) activation function.

Further, the output of the time convolution is passed to the RNN 109B via the pass-through layer (fig. 1B), with the inputs of the RNN 109B being processed sequentially at the RNN 109B.

Fig. 4 illustrates sequential processing performed by RNN109B (e.g., by GRU 113 in fig. 1B) according to an example embodiment. The RNN109b is configured to sequentially process data from the input multidimensional time series 401, the dimensions (time x input channels) of which represent the number of time steps in the input time series and the number of channels in the input time series, respectively. To this end, the input time series 401 is reshaped into a plurality of shorter time windows 405 each having the same number of channels as the input time series 401. The window 405 is then sequentially passed to the RNN109 b. In a preferred embodiment, RNN109b is implemented as a GRU (such as GRU 113). Alternatively, in some embodiments, RNN109b may be implemented using long-term memory (LSTM) neural networks.

After the RNN has sequentially processed all shorter time windows 405 of the input time series 401, the sequential outputs 407 of the RNN109b are re-layered into longer time windows to form an output time series 403 of the RNN, the dimensions (time x input channels) of which represent the number of time steps in the output time series (which in some embodiments are the same as the number of time steps in the input time series) and the number of channels in the output time series, respectively. In some implementations, the outputs 407 may be re-layered into the output time series in the reverse order of the layering shown in fig. 4.

Once the entire input time series 401 has been sequentially passed through the RNN and re-layered into the output time series 403, it is ready to cascade (e.g., cascade 115 in fig. 1B) with the time series output obtained by processing the same input time series using a more standard U-net pass (e.g., 1 x 1 convolution 117 in fig. 1B) that has been performed using parallel (i.e., not inherently sequential) computations.

In this way, the sequential temporal processing of RNN 109b is coupled with the temporal parallel processing of time-series U-net 109a, enabling PPG estimator module 109 to more accurately estimate the PPG signal from the multi-dimensional time-series signal.

Some embodiments are based on the following recognition: in a narrow band including near infrared frequencies of 940nm, the signal observed by the NIR camera is significantly weaker than the signal observed by a color intensity camera (such as an RGB camera). However, the iPPG system 100 is configured to handle such weak intensity signals by using a band pass filter. The band pass filter is configured to noise reduce the measurement of the pixel intensity for each of the different spatial regions. Further details regarding the processing of the NIR signal to the estimated iPPG signal are described below with reference to fig. 5.

Fig. 5 shows a graph for comparison of a PPG signal spectrum obtained using NIR with a visible part of the spectrum (RGB), according to an example embodiment. As can be seen from fig. 5, the iPPG signal 501 in the NIR (labeled "NIR iPPG signal" in the legend) is about 10 times weaker than the iPPG signal 503 in the RGB (labeled "RGB iPPG signal"). Thus, in some embodiments, the iPPG system 100 comprises: a Near Infrared (NIR) light source illuminating human skin, wherein the NIR light source provides illumination in a first frequency band; and a camera including a processor to measure the intensity of each different region in a second frequency band overlapping the first frequency band such that the measured intensity of a skin region is calculated from the pixel intensities of the image of the skin region.

In some embodiments, the first frequency band and the second frequency band comprise near infrared frequencies of 940 nm. The iPPG system 100 can comprise a filter to denoise the measurement of the intensity of each different region. For this purpose, techniques such as Robust Principal Component Analysis (RPCA) may be used. In an embodiment, the second frequency band has a passband with a width less than 20nm, e.g., the bandpass filter has a narrow passband with a Full Width Half Maximum (FWHM) less than 20nm. In other words, the width of the overlap between the first frequency band and the second frequency band is less than 20nm.

Some embodiments are based on the following recognition: optical filters such as bandpass filters and long-pass filters (i.e., filters that block light having a wavelength less than the cutoff frequency from transmitting but allow light having a wavelength greater than the second cutoff frequency to transmit) may be highly sensitive to the angle of incidence of light passing through the filter. For example, the optical filter may be designed to transmit and block a specified frequency range when light enters the optical filter parallel to an axis of symmetry of the optical filter (substantially perpendicular to the surface of the optical filter), which may be an angle of incidence of 0 °. When the angle of incidence varies from 0 °, many optical filters exhibit a "blue shift" in which the passband and/or cut-off frequency of the filter is effectively shifted toward shorter wavelengths. To address the blue shift phenomenon, some embodiments use the center frequency of the overlap between the first and second frequency bands to have a wavelength greater than 940nm (e.g., the center frequency of the bandpass optical filter or the cutoff frequency of the long-pass optical filter is shifted to have a wavelength longer than 940 nm).

Since light from different parts of the skin may be incident on the optical filter at different angles of incidence, the optical filter allows light from different parts of the skin to be transmitted differently. In response, some embodiments use a bandpass filter with a wider passband (e.g., a bandpass optical filter with a passband wider than 20 nm), and thus the overlap width between the first and second frequency bands is greater than 20nm.

In some embodiments, the iPPG system 100 uses a narrow band comprising near infrared frequencies of 940nm to reduce noise due to illumination variations. As a result, the iPPG system 100 provides accurate estimation of human vital signs.

Some embodiments are based on the following recognition: due to factors such as variations in the 3D direction of the normal over the entire face surface, the illumination intensity may not be uniform over the entire body part (e.g., the face of a person) due to the projection onto the face, due to the different distances of different parts of the face from the NIR light source. To make illumination more uniform across the face, some embodiments use multiple NIR light sources (e.g., two NIR light sources placed on each side of the face and approximately equidistant from the head). In addition, horizontal and vertical diffusers are placed over the NIR light source to widen the beam reaching the face, minimizing the difference in illumination intensity between the center of the face and the perimeter of the face.

Some embodiments aim to take well-exposed images of skin areas in order to measure strong iPPG signals. However, the intensity of illumination is inversely proportional to the square of the distance from the light source to the face. If the person is too close to the light source, the image becomes saturated and may not contain iPPG signals. If the person is far from the light source, the image may darken and have a weak iPPG signal. Some embodiments may select the most advantageous position of the light source and its brightness setting to avoid taking saturated images while recording well-exposed images at the range of possible distances between the person's skin area and the camera.

In some implementations (such as the implementation shown in fig. 1B), the type of U-net architecture used in the time-series U-net 109a is sometimes referred to as a "V-net" because the systolic path of the U-net uses a stride convolution instead of a max-pooling operation to reduce the size of the feature map in the systolic layer. In another embodiment, the time-series U-network 109a may be replaced by an architecture based on any other U-network, such as the one using maximum pooling in the shrink layer. In other example embodiments, RNN 109b may be implemented using at least one of a GRU architecture or a Long Short Term Memory (LSTM) architecture.

Furthermore, in order to enable the PPG estimator module 109 to accurately estimate the PPG signal, the PPG estimator module 109 is trained. Details regarding the training of the PPG estimator module 109 are described below.

Training of turneip (PPG estimator module):

to train the turneip, one or more training loss functions may be used. One or more training loss functions are used to determine an optimal value for the weight to weight the feature such that similarity between the true value and the estimated value is maximized. For example, let y denote the true PPG signal andrepresenting the estimated PPG signal in the time domain. In some embodiments, the learning goal of training turneip is to find the optimal network weights θ that maximize the pearson correlation coefficient (Pearson correlation coefficient) between the true PPG signal and the estimated PPG signal ^* . Thus, any two of length TThe training loss function G (x, z) for each vector x and z is defined as:

wherein mu _x Sum mu _z The sample mean for x and z, respectively. The one or more loss functions may include one or both of Time Loss (TL) and Spectral Loss (SL).

To minimize TL, the network (i.e., turneip) parameters are sought such that:

to minimize SL, in some embodiments, the input of the loss function is first transformed to the frequency domain (e.g., using a Fast Fourier Transform (FFT)), and any frequency components that lie outside the desired frequency range are suppressed. For example, for heart rate, frequency components lying outside the [0.6,2.5] hz band are suppressed, as they are outside the typical range of human heart rate. In this case, the network parameters are calculated to solve:

wherein y=fft (Y)And |·| represents the complex modulo operator.

Training data set:

in an example embodiment, TURNIP is trained based on MERL-Rice near infrared pulse (MR-NIRP) automobile data sets. The dataset contained facial video recorded with an NIR camera equipped with a 940 ± 5nm band pass filter. The frames are recorded at 30 frames per second (fps) and have a resolution of 640 x 640 and a fixed exposure. The actual PPG waveform is obtained using a finger pulse oximeter (e.g., cms50d+) that records at 60fps, and then downsampled to 30fps and synchronized with the video recording. The dataset characterizes 18 subjects and is divided into two main scenarios, labeled driving (city driving) and garage (parking while engine is running). Furthermore, only the "minimum head movement" condition is evaluated for each scene. The dataset includes female and male subjects with or without facial hair. Video was recorded under different weather conditions at night and during the day. All records of the garage setting are 2 minutes (3,600 frames) long and are during a driving range from 2 minutes to 5 minutes (3,600-9,000 frames).

Furthermore, the training dataset consists of subjects with heart rates ranging from 40 to 110 heartbeats per minute (bpm). However, the heart rate profile of the test subjects is not uniform. For most subjects, heart rates range approximately from 50bpm to 70bpm. The data set has a smaller number of outliers. Thus, data enhancement techniques are used to address both: (i) a relatively small number of subjects; and (ii) gaps in the subject heart rate profile. During training, signals with linear resampling rates 1+r and 1-r are resampled for each 10 second window, in addition to the 48-dimensional PPG signal output by the time series extraction module 101, where the value of r e [0.2,0.6] is randomly selected for each 10 second window.

Thus, data enhancement is useful for those subjects whose heart rate is outside of the distribution. Thus, it is desirable to train the turneip using as many examples as possible for a given frequency range.

In an example embodiment, the TURNIP is trained for 10 epochs (epochs) and tested using a trained model (also referred to as "reasoning"). In another embodiment, the TURNIP may be trained for less than 10 epochs. In an example embodiment, an Adam optimizer is selected, the batch size is 96, and the learning rate is 1.5.10 ^-4 . The learning rate decreases by a factor of 0.05 at each epoch. In addition, a training-testing protocol is used that leaves a subject cross-validated. At test times (i.e., inference times), the time series of the test subject is windowed using the time series extraction module 101, and heart rate is estimated sequentially with 10 sample steps between the windows. In an example embodiment, one heart rate estimate is output every 10 frames.

Furthermore, two metrics are used to evaluate the performance of the system. The first measure, percentage of time with error less than 6bpm (PTE 6), represents the percentage of Heart Rate (HR) estimate that deviates from the true value by less than 6 bpm. The error threshold is set to 6bpm because this is the expected frequency resolution for the 10 second window. The second measure is the Root Mean Square Error (RMSE) between the true value and the estimated HR. The second metric is measured in bpm for each 10 second window and averaged over the test sequence.

Without data enhancement, the standard deviation for PTE6 for the iPPG system 100 was quite high, indicating high variability between subjects. In addition, the impact of data enhancement on the test subjects was also analyzed.

Fig. 6A illustrates the effect of data enhancement on the percentage of time that the error is less than 6bpm (PTE 6 metric) according to an example embodiment. Fig. 6B illustrates the effect of data enhancement on Root Mean Square Error (RMSE) metrics, according to example embodiments. The portions covered by the rectangles in fig. 6A and 6B represent poor performance of the iPPG system 100 without data enhancement for two subjects with heart rates outside of the distribution. In the dataset, subjects 10 and 12 have the highest resting heart rate and the lowest resting heart rate, respectively-40 and-100 bpm. Thus, when any of these subjects is tested, the training set does not contain subjects with similar heart rates. Without data enhancement, the turneip fails entirely for these subjects. With data enhancement, it is much more accurate.

Furthermore, the effect of the GRU units in the through connection was analyzed. The GRU processes the feature map sequentially at a plurality of time resolutions. Thus, they extract features that exceed the local receptive field of the convolution kernel used by the convolution layer of the turneip. Adding a GRU improves the performance of the iPPG system 100. Furthermore, the two training loss functions TL and SL used for training are compared.

Fig. 7 shows a comparison of PPG signals estimated by a turneip trained using TL and a turneip trained using SL for a test subject according to an example embodiment. Fig. 6 compares SL and TL for an estimated PPG signal within 10 seconds of the test subject. From fig. 6, it is apparent that the performance of the turneip with SL training in PPG signal estimation is lower than that of the turneip with TL training. As shown in fig. 7, the turneip with TL training generates a better estimate of the true PPG signal. Although the signal recovered with SL has a similar frequency, it typically does not match the peak and the signal amplitude or shape is distorted. That is, the spectrum and heart rate of the recovered signal are similar in both cases, but the temporal variations are different. Thus, in a preferred embodiment, the TURNIP may be trained using the TL training loss function.

Exemplary embodiments:

fig. 8 illustrates a block diagram of an iPPG system 800 according to an example embodiment. The system 800 includes a processor 801 configured to execute stored instructions, and a memory 803 storing instructions executable by the processor 801. Processor 801 may be a single core processor, a multi-core processor, a computing cluster, or any number of other configurations. Memory 803 may include Random Access Memory (RAM), read Only Memory (ROM), flash memory, or any other suitable memory system. The processor 801 is connected to one or more input and output devices by a bus 805.

The instructions stored in the memory 803 correspond to an iPPG method for estimating vital signs of a person based on a set of iPPG signal waveforms measured from different regions of the person's skin. iPPG system 800 can further comprise a storage 807 configured to store various modules, such as time series extraction module 101 and PPG estimator module 109, wherein PPG estimator module 109 comprises a time series U-net 109a and RNN 109b. The above modules stored in the storage 807 are implemented by the processor 801 to perform vital sign estimation. The vital signs correspond to a pulse rate of the person or a heart rate variation of the person. The storage 807 may be implemented using a hard disk drive, optical disk drive, thumb drive, drive array, or any combination thereof.

The time series extraction module 101 obtains images from frames of video from one or more videos 809 fed to the iPPG system 800, wherein the one or more videos 809 comprise videos of a body part of a person whose vital sign is to be estimated. One or more videos may be recorded by one or more cameras. The time sequence extraction module 101 may segment the image from each frame into a plurality of spatial regions corresponding to the ROI of the body part as a strong indicator of the PPG signal, wherein the segmentation of the image into the plurality of spatial regions forms an image sequence of the body part. Each image includes a different region of skin of the body part in the image. The image sequence may be transformed into a multi-dimensional time series signal. The multi-dimensional time series signal is provided to a PPG estimator module 109. The PPG estimator module 109 processes the multi-dimensional time series signal by time convolving the multi-dimensional time series signal using the time series U-net 109a and RNN 109b, and the convolved data is further processed sequentially by RNN 109b to estimate PPG waveforms, wherein the PPG waveforms are used to estimate vital signs of the person.

The iPPG system 800 comprises an input interface 811 that receives one or more videos 809. For example, the input interface 811 may be a network interface controller adapted to connect the iPPG system 800 to a network 813 through the bus 805.

Additionally or alternatively, in some implementations, the iPPG system 800 is connected to a remote sensor 815 (such as a camera) to collect one or more videos 809. In some implementations, a human-machine interface (HMI) 817 within the iPPG system 800 connects the iPPG system 800 to an input device 819, such as a keyboard, mouse, trackball, touchpad, joystick, pointer stick, pen, touch screen, and the like.

The iPPG system 800 can be linked to an output interface by a bus 805 to present PPG waveforms. For example, the iPPG system 800 can comprise a display interface 821 adapted to connect the iPPG system 800 to a display device 823, wherein the display device 823 can comprise, but is not limited to, a computer monitor, projector, or mobile device.

The iPPG system 800 can further comprise and/or be connected to an imaging interface 825, the imaging interface 825 being adapted to connect the iPPG system 800 to an imaging device 827.

In some implementations, the iPPG system 800 can be connected to the application interface 829 by a bus 805, the bus 805 being adapted to connect the iPPG system 800 to an application system 831 that can operate based on estimated vital signals. In an exemplary scenario, the application system 831 is a patient monitoring system that uses vital signs of a patient. In another exemplary scenario, the application system 831 is a driver monitoring system that uses vital signs of the driver to determine whether the driver is able to drive safely, such as whether the driver is drowsy.

Fig. 9 illustrates a patient monitoring system 900 using an iPPG system 800 according to an example embodiment. To monitor vital signs of a patient, a camera 903 is used to take images, i.e. a video sequence, of the patient 901.

The camera 903 may include a CCD or CMOS sensor for converting incident light and its intensity variation into an electrical signal. The camera 903 non-invasively captures light reflected from the skin portion of the patient 901. The skin portion may thus particularly refer to a forehead, neck, wrist, a portion of an arm, or some other portion of the patient's skin. A light source (e.g., a near infrared light source) may be used to illuminate a patient or a region of interest including a portion of the patient's skin.

Based on the captured images, the iPPG system 800 determines vital signs of the patient 901. Specifically, the iPPG system 800 determines vital signs of the patient 901 such as heart rate, respiration rate, or blood oxygen. Further, the determined vital signs are typically displayed on an operator interface 905 for presenting the determined vital signs. Such an operator interface 905 may be a patient bedside monitor or may also be a remote monitoring station in a dedicated room in a hospital, such as a remote monitoring station in a community care facility of a nursing home, or even a remote monitoring station in a remote location in a remote medical application.

Fig. 10 illustrates a driver assistance system 1000 using an iPPG system 800 according to an example embodiment. The NIR light source and/or NIR camera 1001 are arranged within the vehicle 1003. Specifically, the NIR camera 1001 may be disposed in a field of view (FOV) 1007 of the photographing driver 1005. The iPPG system 800 is integrated into a vehicle 1003. The NIR light source is configured to illuminate the skin of a person driving the vehicle (driver 1005), and the NIR camera 1001 is configured to record video of the driver in real time. Further, NIR video is fed to the iPPG system 800 to measure iPPG signals from different areas of the skin of the driver 1005. The iPPG system 800 receives the measured iPPG signal and determines vital signs of the driver 1005, such as pulse rate.

Further, the processor of the iPPG system 800 can generate one or more control action commands based on estimated vital signs of a driver 1005 of the vehicle 1003. The one or more control action commands include vehicle braking, steering control, generating an alert notification, initiating an emergency service request, or switching driving modes. One or more control action commands are transmitted to a controller 1005 of the vehicle 1003. The controller 1005 may control the vehicle 1003 according to one or more control action commands. For example, if the determined pulse rate of the driver is very low, the driver 1005 may be experiencing a heart attack. Thus, the iPPG system 800 can generate control commands and/or initiate emergency service requirements for reducing vehicle speed and/or steering control (e.g., steering a vehicle to a shoulder of an expressway and stopping it).

The above description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the foregoing description of the exemplary embodiments will provide those skilled in the art with a thorough description of the exemplary embodiments for implementing the one or more exemplary embodiments. It is contemplated that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the disclosed subject matter as set forth in the appended claims.

In the above description, specific details are given to provide a thorough understanding of the embodiments. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements in the disclosed subject matter may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Moreover, like reference numbers and designations in the various drawings indicate like elements.

Further, various embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. Additionally, the order of operations may be rearranged. A process may terminate when its operations are completed, but may have additional steps not discussed or included in the figure. Moreover, not all operations in any particular described process may occur in all embodiments. A process may correspond to a method, a function, a process, a subroutine, etc. When a process corresponds to a function, the termination of the function may correspond to the return of the function to the calling function or the main function.

Furthermore, embodiments of the disclosed subject matter may be implemented at least partially manually or automatically. The manual or automatic implementation may be performed or at least assisted by using machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine readable medium. The processor may perform the necessary tasks.

The various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine. Generally, the functionality of the program modules may be combined or distributed as desired in various embodiments.

Embodiments of the present disclosure may be embodied as a method, examples of which have been provided. Acts performed as part of the method may be ordered in any suitable way. Thus, an embodiment may be constructed in which acts are performed in a different order than illustrated, which may include performing some acts simultaneously, even though the acts are shown as sequential acts in the exemplary embodiment. Although the present disclosure has been described with reference to certain preferred embodiments, it should be understood that various other adaptations and modifications may be made within the spirit and scope of the present disclosure. It is therefore intended that the appended claims cover all such variations and modifications as fall within the true spirit and scope of this present disclosure.

Claims

1. An imaging photoplethysmography, iPPG, system for estimating vital signs of a person from skin images of the person, the iPPG system comprising: at least one processor; and a memory having instructions stored thereon that, when executed by the at least one processor, cause the iPPG system to:

receiving a sequence of images of different areas of the skin of the person, each area comprising pixels of different intensities indicative of a color change of the skin;

transforming the image sequence into a multi-dimensional time series signal, each dimension corresponding to a different region of the skin;

processing the multi-dimensional time series signal with a time series U-network of nerve cells to generate a PPG waveform, wherein the U-shape of the time series U-network of nerve cells comprises a systolic path followed by an diastolic path comprising a series of systolic layers, the systolic path comprising a series of diastolic layers, wherein at least some of the systolic layers downsample their inputs and at least some of the diastolic layers upsample their inputs forming pairs of systolic and diastolic layers of respective resolution, wherein at least some of the respective systolic and diastolic layers are connected by a pass-through layer, and wherein at least one of the pass-through layers comprises a recurrent neural network that sequentially processes their inputs;

Estimating vital signs of the person based on the PPG waveform; and

presenting the estimated vital sign of the person.

2. The iPPG system of claim 1, wherein at least one of the series of shrink layers uses a stride convolution with a stride greater than 1 to downsample its input to downsample and process the input.

3. The iPPG system of claim 1, wherein at least one expansion layer of the series of expansion layers upsamples its input using an upsampling operation to produce an upsampled input, and wherein the expansion layer comprises a plurality of convolution layers that process the upsampled input.

4. The iPPG system of claim 1, wherein the recurrent neural network comprises a gated recurrent unit GRU or a long short term memory LSTM network.

5. The iPPG system of claim 1, wherein a shrink layer in the series of shrink layers receives its input from a preceding shrink layer and submits its output to both a next shrink layer in the series of shrink layers and a corresponding pass-through layer.

6. The iPPG system of claim 1, wherein to estimate vital signs of the person from the PPG waveform, the at least one processor is configured to process each segment in the sequence of overlapping segments of the multi-dimensional time series signal with the time series U-net neural network.

7. The iPPG system of claim 6, wherein the signal of the vital sign of the person is a one-dimensional signal.

8. The iPPG system of claim 1, wherein to generate the multi-dimensional time series signal, the at least one processor is configured to:

identifying different areas of the person's skin using facial landmark detection; and

pixel intensities of pixels from respective ones of the different regions at a time instant are averaged to produce values for respective dimensions of the multi-dimensional time series signal at the time instant.

9. The iPPG system of claim 8, wherein each dimension of the multi-dimensional time series signal is a signal corresponding to a respective region of the different regions of the skin, wherein each region is an explicitly tracked region of interest ROI.

10. The iPPG system of claim 1, wherein the transformation comprises a cascading operation that combines more than one multi-dimensional time series each extracted from a different channel of a multi-channel video into a single multi-dimensional time series comprising the multi-dimensional time series signal.

11. The iPPG system of claim 1, wherein the transformation comprises a linear combination that combines more than one multi-dimensional time series each extracted from a different channel of a multi-channel video into a single multi-dimensional time series comprising the multi-dimensional time series signal.

12. The iPPG system of claim 1, wherein the transforming comprises extracting more than one multi-dimensional time series of each extracted one channel of a multi-channel video, and shaping the more than one multi-dimensional time series into a 3D array comprising the multi-dimensional time series signals.

13. The iPPG system of claim 1, wherein the time series U-net neural network is trained to maximize pearson correlation coefficients between real data associated with the PPG waveform and the estimated PPG signal.

14. The iPPG system of claim 1, wherein the time series U-net neural network is trained with a time loss function or a spectrum loss function.

15. The iPPG system of claim 1, wherein the vital sign is one or a combination of a pulse rate of the person and a heart rate variation of the person.

16. The iPPG system of claim 1, wherein the person corresponds to a driver of a vehicle, and wherein the at least one processor is further configured to generate one or more control commands for a controller of the vehicle based on vital signs of the driver.

17. The iPPG system of claim 16, further comprising:

a controller configured to perform a control action based on a signal of a vital sign of the person.

18. The iPPG system of claim 1, further comprising:

a camera comprising a processor configured to measure intensities indicative of color changes of said skin at different moments in time to produce said sequence of images,

a display device configured to display a signal of a vital sign of the person.

19. A method for estimating vital signs of a person, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions when executed by the processor implement the steps of the method, the method comprising the steps of:

Estimating vital signs of the person based on the PPG waveform; and

presenting the estimated vital sign of the person.

20. A non-transitory computer readable storage medium having embodied thereon a program executable by a processor for performing a method comprising the steps of:

Estimating vital signs of the person based on the PPG waveform; and

presenting the estimated vital sign of the person.