WO2023055862A1

WO2023055862A1 - Systems and methods for platform-agnostic, real-time physiologic vital sign detection from video stream data

Info

Publication number: WO2023055862A1
Application number: PCT/US2022/045126
Authority: WO
Inventors: Mykola MAKSYMENKO; Tetiana YEZERSKA; Vladyslav SELOTKIN; Marian PETRUK; Oleksandra KOZHEMIAKINA; Petro FRANCHUK; Vladyslav TSYBULNYK; Oleh MENCHYSHYN
Original assignee: Softserve, Inc.
Priority date: 2021-10-01
Filing date: 2022-09-29
Publication date: 2023-04-06
Also published as: WO2023055862A9

Abstract

Presented herein are systems and methods for platform-agnostic, real-time, touchless (remote), automated physiologic vital sign detection from video stream data. In certain embodiments, these systems and methods facilitate use of a wide variety of commercially - available consumer electronics - e.g., smartphones, personal computers, tablets, browser applications, cameras, and other hardware and software - to generate and transmit image data, for example, a video stream, which is automatically analyzed to detect the vital signs of a subject, such as heart rate/pulse (and heart rate variability), peripheral capillary oxygen saturation (SpO₂), respiration rate, emotional state, and/or other signals derived therefrom.

Description

SYSTEMS AND METHODS FOR PLATFORM-AGNOSTIC, REAL-TIME PHYSIOLOGIC VITAL SIGN DETECTION FROM VIDEO STREAM DATA

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Application No. 63/251,108, filed October 17, 2021, which is herein incorporated by reference in its entirety.

FIELD

[0002] The present disclosure relates generally to systems and methods for remote and/or contactless detection and monitoring of biological vital signs.

BACKGROUND

[0003] An increasing number of patients are receiving health-related services via electronic and telecommunication technologies. These technologies allow remote, “virtual” visits between a patient and his/her doctor or other medical practitioner or clinician. The telehealth industry has seen rapid growth in the wake of the COVID-19 pandemic.

Telehealth can facilitate access to care, reduce risk of transmission of SARS-CoV-2 and other pathogens, and reduce strain on health care capacity and facilities.

[0004] There are significant limitations of telemedicine due to the inability of a medical practitioner to examine a patient in the flesh, although there has been some progress in the remote, automated detection of vital signs. For example, software has been developed to detect pixel intensity changes and/or track body movement from video streams to identify heart rate, respiration rate, and other derivative signals [e.g., see G. Balakrishnan, F. Durand and J. Guttag, “Detecting Pulse from Head Motions in Video,” 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3430-3437, doi:

10.1109/CVPR.2013.440; and Ming-Zher Poh, Daniel J. McDuff, and Rosalind W. Picard, “Non-contact, automated cardiac pulse measurements using video imaging and blind source separation.,” Opt. Express 18, 10762-10774 (2010)].

[0005] These approaches generally suffer in quality due to a lack of uniformity in the video image data being captured and analyzed. For example, different patients use different smartphones, personal computers, tablets, browser applications, and other equipment and networks in different environments, under different and/or changing lighting conditions, and the like. Requiring patients to use standardized equipment under standardized conditions may improve vital sign detection results, but this may be impractical or expensive. Vital sign sensors, for example, in the form of patient wearables, may improve accuracy, but they have the disadvantage of increased cost and reliance on proper use by the patient.

[0006] There is a need for improved systems and methods for remote, automated vital sign detection.

SUMMARY OF THE INVENTION

[0007] Presented herein are systems and methods for platform-agnostic, real-time, touchless (remote), automated physiologic vital sign detection from video stream data. In certain embodiments, these systems and methods facilitate use of a wide variety of commercially-available consumer electronics - e.g., smartphones, personal computers, tablets, browser applications, cameras, and other hardware and software - to generate and transmit image data, for example, a video stream, which is automatically analyzed to detect the vital signs of a subject, such as heart rate/pulse (and heart rate variability), peripheral capillary oxygen saturation (SpCb). respiration rate, emotional state, and/or other signals derived therefrom.

[0008] For example, in certain embodiments, the vital signs are remotely detected without the need for patient wearables or other special equipment, and without the need for the subject to use standardized hardware or software. The subject can use the Internet- connected hardware and software he/she already owns and is already familiar with using. Moreover, the systems and methods presented herein accommodate a variety of scenarios that include light and motion variability. This allows for conducting fast and easy health checks. For example, if abnormal measurements are detected, a medical professional may be alerted. [0009] In certain embodiments, problems such as ambient light variability and/or camera/subject motion is overcome via one or more of the following, as presented in detail herein: (i) robust signal processing techniques, (ii) light and motion compensation and stabilization, and (iii) a novel frequency transformation technique for biosignal frequency reading (e.g., application of two-dimensional graphical spectrogram representation to ensure signal stability).

[0010] In certain embodiments, the systems and methods are used for telemedicine, for example, for remote video meetings (teleconsultations) between a patient and a medical practitioner. In other embodiments, the systems and methods may be utilized by an individual via the individual’s smartphone or other computing device, for automated detection and/or tracking of that individual’s (or a family member’s) vital signs. These embodiments may be of particular value, for example, in the early diagnosis, prognosis, and/or initiation of treatment of COVID-19 or other conditions, where heart rate, blood oxygen level, and/or respiration rate are potential indicators of disease.

[0011] The systems and methods described herein enable real-time vital sign detection on multiple platforms (e.g., consumer camera, browser application, mobile or desktop application, and the like). In certain embodiments, the systems and methods described herein are implemented as a cross-platform software development kit (SDK), for example, a multistage end-to-end signal processing pipeline wrapped as a C++ SDK. The systems and methods described herein may include wrappers (e.g., written in Python, Java, and/or C++) for integration with typical applications scenarios. In some embodiments, the systems and methods described herein include a desktop application, a browser application utilizing server-client architecture, and/or a mobile application (e.g., a smartphone app, e.g., a standalone mobile application).

[0012] In certain embodiments, the solutions utilize a multithreaded design which provides efficient and fast processing of frame sequences for near real-time execution of enduser applications. This applies for both edge-device applications as well as a cloud-based scheme in which the signal processing is accomplished on a remote server.

[0013] In certain embodiments, the solution is a multistage end-to-end framework/SDK for vital signs estimation from RGB camera data, e.g., video data transmitted through an electronic network (e.g., the Internet) from a consumer smartphone, laptop, or desktop computer. The digital signal processing pipeline is designed for efficient multithread processing such that different steps of the pipeline can be run in parallel, e.g., for different sequences of frames obtained from the camera. In certain embodiments, the video digital signal processing pipeline includes the following steps, each of which is described in more detail in the Detailed Description section herein: (i) face detection, (ii) facial landmarks detection and region of interest (ROI) selection, (iii) ambient light detection and compensation module, (iv) motion compensation module, (v) color calibration using a common reference object, and (vi) heart-rate computation pipeline, which involves (a) skin tone normalization, (b) a combination of approaches for low-dimensional signal extraction (Green algorithm, Plane Orthogonal to Skin-tone (POS) algorithm, and chrominance (CHROM) algorithm), (c) a frequency-based approach for heart-rate extraction (e.g., use of Lomb-Scargle frequency transform), and (d) signal enhancement and stabilization via 2D spectrogram analysis. [0014] In one aspect, the invention is directed to a system for real-time (including near-real-time) detection of one or more vital signs [e.g., heart rate/pulse, peripheral capillary oxygen saturation (SpCh), respiration rate, emotional state, or a signal derived from one or more of the above] of a subject from a video stream (e.g., image data from an RGB camera) depicting the subject, the system comprising: a processor of a computing device; and a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to: receive a digital signal corresponding to a series of frames of the video stream depicting the subject; for each of a plurality of the frames, identify a face region of interest (ROI) (e.g., apply a face bounding box) and identify a plurality of facial landmarks within the face ROI (e.g., identify a forehead region, a cheeks region, and an adaptive region); extract photoplethysmography (PPG) data (e.g., a low-dimensional signal) from the identified facial landmarks in the frames [e.g., wherein the PPG data comprises one or more of (i), (ii), and (iii) as follows: (i) Green: a mean of a green channel in a plurality of frames, (ii) CHROM: a color difference signal, e.g., assuming a standardized skin-tone, and (iii) POS: a projection of the signal to a plane, e.g., a 2D plane orthogonal to the skin-tone in a temporally normalized RGB space]; and identify, in real-time (or near real-time), the one or more vital signs (e.g., a heart rate/pulse) of the subject from the PPG data (e.g., using a frequency transform, e.g., a Lomb-Scargle frequency transform), (e.g., track, in real/near-real time, the one or more vital signs over a time period contemporaneous with that depicted in the video stream).

[0015] In certain embodiments, the instructions cause the processor to identify the one or more vital signs from the PPG data by performing a data signal stabilization based on a 2D frequency representation, ensuring a dominant component has connectivity over a sufficiently extended period of time (e.g., using a 2D graphical spectrogram representation of the PPG data).

[0016] In certain embodiments, the instructions cause the processor to apply (i) light compensation and/or stabilization, and (ii) motion compensation and/or stabilization to the plurality of frames of the video stream.

[0017] In certain embodiments, the processor utilizes a multithreading architecture for enhanced real-time/near-real-time performance (e.g., to perform parallel processing of data from different sequences of frames of the video stream).

[0018] In certain embodiments, the instructions further cause the processor to identify a poor light condition from the frames of the video stream and, upon identification of said poor light condition, render a prompt to a user (e.g., the subject) via a user interface indicating said poor light condition (e.g., and/or recommending the user find a better lit environment before proceeding).

[0019] In certain embodiments, the instructions further cause the processor to identify excessive motion of the subject from the frames of the video stream (e.g., motion which cannot be accurately compensated for) and, upon identification of said excessive motion, render a prompt to a user (e.g., the subject) via a user interface indicating said motion (e.g., and/or recommending the subject to stay still).

[0020] In another aspect, the invention is directed to a method for real-time (including near-real-time) automated detection of one or more vital signs [e.g., heart rate/pulse, peripheral capillary oxygen saturation (SpCh), respiration rate, emotional state, or a signal derived from one or more of the above] of a subject from a video stream (e.g., image data from an RGB camera) depicting the subject, the method comprising: receiving, by a processor of a computing device, a digital signal corresponding to a series of frames of the video stream depicting the subject; for each of a plurality of the frames, identifying, by the processor, a face region of interest (ROI) (e.g., apply a face bounding box) and identify a plurality of facial landmarks within the face ROI (e.g., identify a forehead region, a cheeks region, and an adaptive region); extracting, by the processor, photoplethysmography (PPG) data (e.g., a low-dimensional signal) from the identified facial landmarks in the frames [e.g., wherein the PPG data comprises one or more of (i), (ii), and (iii) as follows: (i) Green: a mean of a green channel in a plurality of frames, (ii) CHROM: a color difference signal, e.g., assuming a standardized skin-tone, and (iii) POS: a projection of the signal to a plane, e.g., a 2D plane orthogonal to the skin-tone in a temporally normalized RGB space]; and identifying, by the processor, in real-time (or near real-time), the one or more vital signs (e.g., a heart rate/pulse) of the subject from the PPG data (e.g., using a frequency transform, e.g., a Lomb-Scargle frequency transform), (e.g., track, in real/near-real time, the one or more vital signs over a time period contemporaneous with that depicted in the video stream).

[0021] In certain embodiments, identifying the one or more vital signs from the PPG data comprises performing a data signal stabilization based on a 2D frequency representation, ensuring a dominant component has connectivity over a sufficiently extended period of time (e.g., using a 2D graphical spectrogram representation of the PPG data).

[0022] In certain embodiments, the method comprises applying, by the processor, (i) light compensation and/or stabilization, and (ii) motion compensation and/or stabilization to the plurality of frames of the video stream. [0023] In certain embodiments, the processor utilizes a multithreading architecture for enhanced real -time/near-real -time performance (e.g., to perform parallel processing of data from different sequences of frames of the video stream).

[0024] In certain embodiments, the method comprises identifying, by the processor, a poor light condition from the frames of the video stream and, upon identification of said poor light condition, rendering a prompt to a user (e.g., the subject) via a user interface indicating said poor light condition (e.g., and/or recommending the user find a better lit environment before proceeding).

[0025] In certain embodiments, the method comprises identifying, by the processor, excessive motion of the subject from the frames of the video stream (e.g., motion which cannot be accurately compensated for) and, upon identification of said excessive motion, rendering a prompt to a user (e.g., the subject) via a user interface indicating said motion (e.g., and/or recommending the subject to stay still).

BRIEF DESCRIPTION OF THE DRAWINGS

[0026] The foregoing and other objects, aspects, features, and advantages of the present disclosure will become more apparent and better understood by referring to the following description taken in conjunction with the accompanying drawings, in which: [0027] FIG. 1 is a schematic of a digital signal processing (DSP) pipeline for heart rate (HR) estimation based on time-sequential frames (e.g., video image data) in which the face of an individual appears, according to an illustrative embodiment.

[0028] FIG. 2 is a schematic illustrating facial landmarks detected by FaceMesh, in accordance with an illustrative embodiment of the DSP pipeline.

[0029] FIGS. 3A, 3B, 3C, 3D, 3E, and 3F are images showing the different shapes and sizes of various ROIs on the cheeks of a subject as they appear in video of the subject, in accordance with an illustrative embodiment of the DSP pipeline.

[0030] FIGS. 4A and 4B are images as they appear in an illustrative software implementation, where FIG. 4A depicts a frame of the user with a graphical user interface (GUI) message overlay indicating successful measurement of heart rate and SpCh vital signs for the user where illumination was identified as satisfactory, while FIG. 4B shows a frame of the user with a GUI message overlay indicating unstable lighting, with a prompt indicating better lighting is needed for vital sign measurement, in accordance with an illustrative embodiment. [0031] FIG. 5 is a schematic showing face landmarks (points) and reference distance (red line) used in a motion compensation module of the DSP pipeline, in accordance with an illustrative embodiment.

[0032] FIG. 6 depicts a chart showing the ratio of number of frames with detected motion to the total number of frames in a given video, used in the setting of thresholds for a motion compensation module of the DSP pipeline, in accordance with an illustrative embodiment.

[0033] FIGS. 7A and 7B are images as they appear in an illustrative software implementation, where FIG. 7A depicts a frame of the user with a graphical user interface (GUI) display of successfully measured heart rate and SpCh vital signs for the user, where there was no excessive motion detected (e.g., and where illumination was identified as satisfactory), and where FIG. 7B depicts a frame of the user with a GUI message indicating excessive motion was detected, and recommending the user remain still for better vital sign detection results, in accordance with an illustrative embodiment.

[0034] FIG. 8A depicts an example of a periodogram, in accordance with an illustrative embodiment.

[0035] FIG. 8B is an example of a spectrogram that describes a stabilization module of the DSP pipeline via 2D spectrogram analysis, in accordance with an illustrative embodiment.

[0036] FIG. 9 is a schematic depicting multithread decomposition for the various modules described herein, in accordance with an illustrative embodiment.

[0037] FIG. 10 is a schematic depicting an illustrative desktop application architecture that works in conjunction with an SDK for executing the DSP pipeline, in accordance with an illustrative embodiment.

[0038] FIG. 11 is a schematic depicting an illustrative browser application utilizing server-client architecture, performing indicated steps in the cloud, in accordance with an illustrative embodiment.

[0039] FIG. 12 is a schematic depicting an illustrative standalone mobile application architecture that works in conjunction with an SDK, in accordance with an illustrative embodiment.

[0040] FIG. 13 is a block diagram of an exemplary cloud computing environment, in accordance with illustrative embodiments.

[0041] FIG. 14 is a schematic depicting an example of a computing device 500 and a mobile computing device 550 that can be used to perform the methods described herein, and/or can be used in the systems described herein, in accordance with illustrative embodiments.

[0042] FIG. 15 is a flowchart diagram depicting a method for real-time automated detection of one or more vital signs of a subject from a video stream, in accordance with illustrative embodiments.

[0043] FIG. 16 is a schematic depicting components of a digital video signal processing pipeline, in accordance with illustrative embodiments.

[0044] FIG. 17 is a schematic depicting a pipeline for PPG calculation using POS method, in accordance with illustrative embodiments.

DETAILED DESCRIPTION

[0045] Described herein are systems and methods of real-time vital signs detection from video image data, e.g., video from a consumer camera. The availability of Internet- and other network-connected consumer electronics cameras (e.g., there are about four billion smartphone users), including television sets, facilitate widespread use of these systems and methods. In certain embodiments, the systems are platform-agnostic, non-invasive, multiplatform, touchless (no wearable device) and provide real-time vital signs detection by implementing a stabilization pipeline to extract the physiological signals from the video data. For example, the systems and methods allow any individual who has a device with a camera to approximately estimate his/her vital signs to analyze well-being and help prevent health problems. Results are comparable, and in some cases superior to, those obtained with pulse oximeters. Stabilization of illumination and motion variability allows for accurate heart rate measurement, particularly since measurement occurs in a wide variety of scenarios.

[0046] Non-contact detection of heart rate and respiration rate from video stream data has been attempted utilizing pixel intensity changes or body motion tracking. However, current approaches are poorly implemented in real-world scenarios in which ambient light and/or camera/subject motion is highly uncontrolled.

[0047] Presented herein are systems and methods that overcome these problems by implementing robust signal processing, light and motion compensation and stabilization, and a novel frequency transformation for biosignal frequency reading, which involves application of a 2D graphical spectrogram representation to ensure signal stability. The multithreaded design described herein provides efficient and fast processing of frame sequences, resulting in near real-time performance of end-user applications. This applies both for edge-device applications as well as a cloud-based scheme in which the signal processing is performed on a remote server.

[0048] A general method 100 for real-time automated detection of one or more vital signs of a subject from a video stream is shown in FIG. 15, in accordance with illustrative embodiments. The method 100 may generally include the steps of receiving a digital signal corresponding to a series of frames of the video system depicting the subject (step 102); identifying a face region of interest (ROI) and identifying a plurality of facial landmarks within the face ROI for each of the frames (step 104); extracting photoplethysmography (PPG) data from the identified facial landmarks in the frames (step 106); and identifying one or more vital signs of the subject from the PPG data in real-time (step 108).

[0049] In certain embodiments, the systems and methods described herein utilize Lomb-Scargle frequency transformation, which provides more robust performance in enduser applications. In certain embodiments, a wavelet transform is used instead of (or in addition to) Lomb-Scargle transformation.

[0050] The systems and methods described herein may use a fully non-leamed (i.e., not machine-learning based) algorithmic signal processing pipeline, or, alternatively, part or all of the pipeline may make use of neural network architecture (e.g., machine learning algorithm, e.g., an end-to-end neural network).

[0051] Unlike other systems for vital signs monitoring, the systems described herein, in preferred embodiments, do not require any special equipment or the in-person presence of medical staff during vital sign detection. For example, an individual can detect his/her heart rate notwithstanding his/her location, in a variety of scenarios that involve light and motion variability. This allows conducting fast and easy health checks. If the measurements are abnormal, a medical professional can be alerted to prevent or attend to possible critical conditions.

[0052] In certain embodiments, the pipeline described herein makes multiple face detection algorithms available and selects them adaptively depending on the device on which the system is running. For example, a DNN (Deep Neural Network) module can be used with Nvidia GPU on Windows (face detection), Histogram of Oriented Gradients (HOG) (face detection) feature method can be used for CPU, and/or MediaPipe (face detection) can be used for Mobile. Facial detection ensures correction motion compensation and further region of interest (ROI) selection.

[0053] In certain embodiments, an ensemble of preprocessing steps are conducted: motion compensation, illumination variability compensation, and skin color normalization. [0054] In certain embodiments, an ensemble of CHROM, POS, and Green algorithms (described further herein) are implemented for a final biometric signal distillation.

[0055] In certain embodiments, the systems and methods utilize Lomb-Scargle transform for the frequency-based transformation of a biometric signal and identification of a dominant frequency of a signal.

[0056] Generally, Fourier transform is used for signal analysis in frequency domain, especially for a signal close to sinusoidal form like a PPG signal. However, Fourier transform works only for equidistantly sampled signals.

[0057] In certain embodiments, the systems and methods described herein add a stabilization step based on a two-dimensional (2D) frequency representation, and ensure that the dominant component has connectivity over a sufficiently extended period. The stabilization step may be implemented at different stages of the pipeline via algorithm enhancements or the addition of steps.

[0058] For example, illumination stabilization and motion compensation allow producing robust PPG signal for heart rate (HR) estimation even with noisy data. The techniques also account for variations in skin color, age, and gender of users. The Lomb- Scargle frequency transformation and spectrogram analysis in time provide a more robust estimation of the dominant component that corresponds to a rate of a biosignal of interest. Real-time performance is further ensured by utilizing a multithreading architecture.

[0059] In certain embodiments, the systems and methods track gaze and/or measure area of ROI to ensure correct head position for the best accuracy.

[0060] The systems and methods described herein can be used in various medical applications for health and wellness monitoring. For example, embodiments described herein provide easy access to patient status in a telemedicine scenario. The embodiments may facilitate use by insurance companies for decision-making about insurance type, and the like. The embodiments may allow for regular basic health checkups for a wide variety of people, regardless of access to a doctor’s office or insurance coverage. The embodiments provide non-invasive, touchless heart rate estimation. The systems described herein may be embedded or otherwise incorporated for use in television (TV) sets, mobile devices such as smartphones, laptop computers, desktop computers, tablets, eyeglasses, smart watches, virtual reality and/or augmented reality (VR/AR) systems, consumer furniture, bathroom equipment, kitchen equipment, safety helmets (e.g., in manufacturing), and in retirement/nursing homes or other managed care scenarios. In certain embodiments, liveliness/wellbeing detection is performed with remote PPG sensing, and/or for remote drug adherence monitoring.

[0061] In certain embodiments, multiple devices (e.g., cameras) and/or sensors are used to provide multi-view heart rate estimation aggregate to improve accuracy. For example, separate video data streams from individual lenses in a multi-camera device may be used for improved vital sign detection accuracy.

Components of digital video signal processing pipeline

[0062] Detailed below are the following components of an illustrative digital video signal processing pipeline, in accordance with various embodiments of the systems and methods described herein, and summarized in FIG. 16: (i) face detection, (ii) facial landmarks detection and region of interest (ROI) selection, (iii) ambient light detection and compensation module, (iv) motion compensation module, (v) color calibration using a common reference object, and (vi) heart-rate computation pipeline, which involves (a) skin tone normalization, (b) a combination of approaches for low-dimensional signal extraction (Green algorithm, Plane Orthogonal to Skin-tone (POS) algorithm, and chrominance (CHROM) algorithm), (c) a frequency-based approach for heart-rate extraction (e.g., use of Lomb-Scargle frequency transform), and (d) signal enhancement and stabilization via 2D spectrogram analysis.

[0063] The above components of the illustrative video digital signal processing pipeline are described herein in reference to FIG. 1. FIG. 1 is a schematic of a digital signal processing (DSP) pipeline for heart rate (HR) estimation based on time-sequential frames (e.g., video image data) in which the face of an individual appears. The pipeline may be implemented as part of an SDK, a desktop application, a browser application utilizing serverclient architecture, and/or a standalone application (e.g., for mobile or desktop).

[0064] (i) Face detection. During this step, a region of interest in each of a plurality of frames of the video stream containing a human face is identified within the frame. As mentioned above, the illustrative pipeline may make multiple face detection algorithms available and selects them adaptively depending on the device on which the system is running - for example, a DNN module can be used with Nvidia GPU on Windows (face detection), Histogram of Oriented Gradients (HOG) (face detection) feature method can be used for CPU, and/or MediaPipe (face detection) can be used for Mobile. In one embodiment, a neural -network based landmark detector is used, for example, MediaPipe Face Detection, which is based on BlazeFace, a face detector tailored for mobile GPU. This is found to be fast and accurate. [0065] (ii) Facial landmarks detection and region of interest (ROI) selection.

Facial landmarks are distinct, key points on a human face that can be generalized from person to person, such as edges of lips, nose line, edges of eyes, and the like. Facial landmarks may be used for a skin region of interest (ROI) extraction step. One facial landmark scheme that may be used includes FaceMesh, a MediaPipe solution created by Google. FaceMesh first detects the face in an image, then predicts the facial landmarks for the face. Instead of repetitive face detection for each new frame of the video stream, FaceMesh tracks the detected earlier face. This leads to near real-time performance on the CPU, on different image scales. One of the advantages of FaceMesh is that is provides 468 facial landmarks, which enables accurate ROI extraction during the next steps of the pipeline. FIG. 2 is a schematic illustrating facial landmarks detected by FaceMesh, in accordance with an illustrative embodiment of the DSP pipeline.

[0066] In one example, two main candidates for ROI are compared: both cheeks and the forehead as ROI. This has been tested on 2 datasets and 57 subjects in this illustration. It was found that the forehead more frequently has some outliers, due to hair, for example, which may have an influence on the algorithm performance. Additionally, cheeks frequently occupy more pixels, which also improves algorithm performance. As a result, in this illustration, cheeks were selected to be the ROI for further HR reading.

[0067] In this illustration, different shapes and sizes of ROI on cheeks were compared for 15 subjects. FIGS. 3A, 3B, 3C, 3D, 3E, and 3F are images showing the different shapes and sizes of various ROIs on the cheeks of a subject as they appear in video of the subject. In this illustration, it was found that the most appropriate shape of ROI is a shape defined by the following landmark points detected by the MediaPipe FaceMesh solution as marked in FIG. 2: 100, 36, 206, 212, 192, 187, 117, 118 and 329, 266, 426, 432, 416, 411, 346, 347 (see FIG. 3F), with the lowest RMSE for all tested shapes. The ROI on FIG. 3A also shows good performance metrics with landmarks at: 231, 233, 198, 203, 214, 138, 137, 227, 228 and 451, 453, 420, 423, 434, 367, 366, 447, 448.

[0068] (iii) Ambient light detection and compensation module. Real-world video capturing scenarios suffer from unstable light conditions and/or combinations of different light sources that can affect the reading of biometric signals from video. In this step, multiple heuristic ideas are applied to detect and compensate for the dynamic light component in the signal. [0069] Assumption # 1. Relation between pixel value, reflectance and illumination.

From Equation 1 below, the value of the pixel at each point of an image (x,y) can be estimated in the following way:

F(x,y) = i(x,y) • r(x,y) (Equation 1)

[0070] Equation 1 is a relationship between the value of the pixel and reflectance.

Here, F(x, y) is the value of a pixel at a given coordinate position (x,y), i(x,y) is illumination, and r(x,y) is reflectance.

[0071] This equality comes from the physics of image formation in the camera, under the assumption that the object has a Lambertian surface (an ideal diffusely reflecting surface, i.e., light is reflected in all directions).

[0072] Assumption # 2. Next, it is assumed that the fluctuations in illumination intensity have a much lower frequency than reflectance. The idea here is that the shape of objects can be quite complex, and their reflectance should then have higher frequency, when the illumination is spread over the whole frame.

[0073] Using Assumptions #1 and #2 above, an efficient algorithm is created that filters out illumination, then detects whether illumination is good enough or insufficient in a particular ROI. The illumination is analyzed on the ROI that is used for HR detection. The idea is to filter out illumination using homomorphic filtering, e.g., a high-pass filter in the frequency domain. For example, the steps of such homomorphic filtering may be as follows:

Log image Fast Fourier Transform (FFT) High-pass filter inverse FFT Exponentiate image

[0074] Now, the quality of the illumination is considered to be the amount of illumination at the ROI. In other words, too much or too little illumination is undesired. Thus, under the assumption that homomorphic filtering removes illumination, the level of illumination should be detected. For example, a heuristic approach involves taking the ratio of mean of the ROI in the filtered image, and in the ROI in the original image:

Mean(image [ROI]) / Mean(homomorphic filtering image [ROI])

[0075] If this ratio is larger or smaller than a determined threshold (e.g., empirically determined threshold), then there is too much or too little illumination. [0076] Thus, in this example, the following steps are performed: (1) take log of the image, (2) perform Fourier transform on the logged image, (3) apply high pass filter, (4) perform inverse Fourier transform, (5) exponentiate the image, (6) normalize the image with respect to the original image (e.g., images usually become much darker after filtering), (7) take the ratio of mean of original image and filtered in ROIs, (8) using heuristic described above, return if the corresponding ROIs are badly illuminated.

[0077] After filtering the cropped face, and its inverse, the mean in each ROI is measured. The ratio of mean of original image to filtered/filtered inverse is taken as measure of illumination. If the ratio of original and filtered is very large, then the original ROI has too low illumination. If the ratio of inverted image and filtered of inverted is very large, then the original image has too high illumination. The threshold for this ratio was determined from tests. In this example, it was found to be unnecessary to work in RGB, since the results are very similar to the use of greyscale. Thus, greyscale images can be used, or greyscale versions of RGB images can be used. In other embodiments, RGB data is used.

[0078] This ambient light detection module serves as a user interface function. For poor light conditions, the measurements of heart rate (and/or other vital signs) are not performed, and the user receives a recommendation to find a better environment for continuing the video stream analysis. For example, FIG. 4A shows an image of the user with a graphical user interface (GUI) message indicating successful measurement of heart rate and SpO₂ vital signs for the user where illumination was identified as satisfactory. FIG. 4B shows an image of the user with a GUI message indicating unstable lighting, with a prompt to find better lighting conditions so that vital signs can be measured. This particular GUI identifies overilluminated zones (tan patch) and underilluminated zones (purple patch).

[0079] (iv) Motion compensation module. Real-world video capturing scenarios (mobile camera, laptop, in-car camera, and the like) may suffer from unstable motion conditions that can affect the reading of biometric signals from video. In this step, multiple heuristic ideas are applied to detect and compensate for motion artifacts in the signal. To this end, motion noise signal information is extracted from the body and background movements, and this component is further removed from the signal of interest.

[0080] Motion is split into four categories: mimical motion (i.e., motion of facial landmarks caused by movements of facial muscles), movement of the head, movement of the camera in hand, and movement in the background. The goal here is to create a generalized approach to motion detection that will react to relevant motion, particularly motion that affects a vital sign reading, e.g., HR reading. [0081] In this example, the solution is landmark based motion detection with a threshold. The steps are as follows: Step (1): receive coordinates of landmarks (e.g., one may use MediaPipe API as the landmark detector) on frame n and frame n+2. Step (2): using those coordinates, calculate M_n, total motion in frame n, using the following equations:

where m'_n is motion of landmark i in frame n, equal to Euclidian distance between the position of landmark frames n and n+2 over reference distance in frame n (Equation 3). The motion of landmarks should be divided by some reference distance on the face, r_n, which is given by Equation 4. See, for example, FIG. 5, which shows face landmarks (points) and reference distance (horizontal line). Differences in sizes of faces in the frame must be considered; otherwise we cannot choose a universal threshold.

[0082] Next, step (3): if motion M_n is larger than the determined threshold, the motion detection algorithm returns True. The threshold may be empirically determined based on a tested dataset, for example, videos affected by different kinds of user and/or camera motion. The detector should be able to identify and/or classify a significant portion of frames as frames with motion, where the video is affected by head movement, mimical motion, walking movements, and some frames with motion when the camera is handheld.

[0083] FIG. 6 depicts a chart showing the ratio of number of frames with detected motion to the total number of frames in a given video. The low values (close to 0.00) represent videos with almost no motion, and the high values (close to 1.00) represent videos with a significant number of frames where there is motion detected.

[0084] The motion detection module also serves a user interface function. For rapid motion conditions, the vital signs measurements are not performed, and the user receives a textual (or other format) GUI notification recommending the user remain still. For example, FIG. 7A shows an image of the user with a graphical user interface (GUI) display of successfully measured heart rate and SpCh vital signs for the user where there was no excessive motion detected (e.g., and where illumination was identified as satisfactory).

FIG. 7B shows an image of the user with a GUI message indicating excessive motion was detected, and recommending the user remain still for better vital sign detection results.

[0085] (v) Color calibration module. It may not be necessary to use color image data for all steps (see description above), but where color image data is used, the frames of the video may be color-calibrated using a common reference object (i.e., a color reference card, which may be a small card that contains calibrated sample color swatches).

[0086] (vi) Heart-rate computation pipeline, which involves (a) skin tone normalization, (b) a combination of approaches for low-dimensional signal extraction (Green algorithm, Plane Orthogonal to Skin-tone (POS) algorithm, and chrominance (CHROM) algorithm), (c) a frequency -based approach for heart-rate extraction (e.g., use of Lomb- Scargle frequency transform), and (d) signal enhancement and stabilization via 2D spectrogram analysis.

[0087] Pulsatile blood has different sensitivities to different color spectrums. It has a very large sensitivity to blue, but blue light cannot penetrate deep into the skin; it has a large sensitivity to green, and green light can achieve a certain depth beneath the skin; and it has a low sensitivity to red, although red light can penetrate deeply into the skin.

[0088] In the illustrative embodiment described here, there are three main algorithms for PPG extraction, that were tested: The “Green algorithm” is a mean of a green channel in each frame, POS projects the signal to a plane to increase dimensionality to two dimensions, and CHROM is a color difference signal. These three algorithms are described in more detail below.

[0089] 1. Green Algorithm - In order to find the variation of the PPG signal over time, only the green channel is taken from the RGB ROI. Then, the mean value of all pixels in the region is computed (except the zero values, which can shift the mean value). Also, it is necessary to compute the mean value, because the camera sensor, face, and landmark detectors are not perfect and the moving contour of the ROI must also be dealt with, due to lighting changes, motion, or stochasticity of the detectors.

[0090] 2. POS algorithm (Plane Orthogonal to Skin-tone direction in the original

RGB space), used for PPG estimation. The full pipeline for PPG calculation using the POS method can be divided into the following four steps (see FIG. 17): (I) Averaging skin pixels values and concatenating them into a vector; (II) Computing the mean over each image channel to remove DC component, i.e., detrending, normalization; and (III) Projecting obtained signal on the POS plane. On the POS plane, some areas show higher sensitivity to the signal than others. Based on the above defined color tradeoff, we can define a plane that defines a high pulsatile area on the POS plane as (3D signal to 2D signal), and (IV) Turning the 2D signal into a ID signal.

[0091] 3. CHROM algorithm - The chrominance (CHROM) signals are generated from the RGB traces with the use of a skin-tone standardized linear combination compatible with different skin colors. To produce a pulse signal that is independent of the presumed stationary color of the light source and its brightness level, each color channel is normalized. If we initially assume white light, we note that the specular reflection affects all channels by adding an identical (white light) specular fraction to their respective diffuse reflection component. This implies that we can eliminate the specular reflection component by using color difference, i.e., chrominance, signals. From three color channels, we can build two orthogonal chrominance signals, e.g., X = R - G and Y = 0.5R + 0.5G - B3. A ratio of the two may be used as a candidate rPPG algorithm. To enable correct functioning with colored light sources, skin-tone standardization was used, where the normalized skin tone for a given set is assumed to be the same for everyone under white light and coefficients are [Rs, Gs, Bs] = [0.7682, 0.5121, 0.3841],

[0092] Among these three algorithms, the CHROM algorithm demonstrated the best performance, with RMSE 4.9 and success rate of 92.5%, in comparison to the POS algorithm (RMSE 6.2 and success rate of 90.1%) and Green algorithm (RMSE 10.4 and success rate 76.5%).

[0093] (c) A frequency-based approach for heart- rate extraction (e g., use of

Lomb-Scargle frequency transform).

[0094] In this example, a Lomb-Scargle frequency transform was found to be more robust than other benchmark methods. For example, Table 1 below compares Lomb-Scargle with FFT : Table 1: Chart of mean RMSE error for Lomb-Scargle and FFT

[0095] In this example, (i) Lomb-Scargle - using PPG with possibly missing values, extracting only valid slices with non-missing values, and applying this data to Lomb-Scargle; and (ii) FFT - using PPG with possibly missing values, using cubic interpolation to restore missing values, and applying this data with interpolated values to FFT.

[0096] The Lomb-Scargle transform was observed to be more accurate than the FFT, with a smaller mean RMSE error in the estimated heart rate, especially for noisy data with motion, facial expressions, and illumination artifacts, as can be expected to occur in such diverse settings and with many different individuals.

[0097] (d) Signal enhancement and stabilization via 2D spectrogram analysis.

Different frequency trends, including periodic signals, can be observed on a spectrogram: for example, HR frequency, motion frequency, and other artifacts that make it possible to separate heart rate from other periodic signals. FIG. 8A is an example of a periodogram, and FIG. 8B is an example of a spectrogram. FIG. 8B depicts a stabilization module of the DSP pipeline via 2D spectrogram analysis, where the dashed white line shows the estimated HR signal over the time domain.

[0098] After obtaining the Lomb-Scargle periodogram, the frequency range is cut off to [40; 200] bpm (beats per minute). The stabilization algorithm searches for the maximum power density peak in the periodogram correlating to the previously found estimates over a finite time period.

[0099] Subroutine_l implicitly finds the best SNR (signal to noise ratio) candidate signal, exploiting a fact that noise signal is short-lived and has a large variance of peak amplitude. It draws an analogy with a PID controller which uses an integral coefficient to dampen prediction fluctuations and to make estimated signals smoother. The routine slices a frequency range of [40; 200] bpm into narrow frequency bins and sums up bin-averaged values over a chosen rolling window period. It returns a frequency value corresponding to the middle point of a bin with the maximum sum which is the strongest signal over the chosen time. [0100] Subroutine_2 minimizes false positive estimates returned by Subroutine_l. False positives may occur due to a strong long-lived noise that produces rapid variations in heart rate measurements. Subroutine_2 keeps track of historical values and assigns a confidence measure to candidate estimates. The routine prevents large changes of the estimated heart rate by rejecting candidates which differ from the previously found value more than a threshold level. In a case of rejection, it returns a previously found value. The routine also returns an estimate confidence measure. Hence, the end-user of the SDK may use confidence to reach the desired accuracy by balancing between number of measurements and confidence threshold.

[0101] Subroutine_3 further refines an estimate returned by Subroutine_2 to get the most precise value. It returns a frequency of the maximum amplitude peak within a narrow frequency range.

[0102] The following pseudo-code statements summarize the inputs and outputs of the Lomb-Scargle periodogram, Subroutine !, Subroutine , and Subgroutine_3: freq_arr , power_arr = lomb_scargle ( time_arr , ppg_arr) hr_est = Subroutine_l (f req_arr , power_arr) hr_est , conf = Subroutine_2 (hr_est) hr_est = Subroutine_3 (hr_est) where f req_arr and power_arr are frequencies and power values of the evaluated Lomb-Scargle periodogram; time_arr is an array of time stamps; ppg_aar is an array of PPG samples; hr_est is estimated heart rate frequency; and is conf is confidence value. [0103] FIG. 9 is a schematic depicting illustrative multithread decomposition for the various modules described herein. There is a ThreadPool (in which frames are entered into a task queue and distributed into multiple parallel threads for face detection and landmark detection), an Image processing Thread, a Heart Rate Calculation Thread, and a Client Thread, with Task Queues as shown, callbacks to client when ready, and display of successfully computed HR value (or other vital sign(s)) upon request from client.

[0104] FIG. 10 is a schematic depicting an illustrative desktop application architecture that works in conjunction with an SDK (indicated in FIG. 10 as “SDK”). The desktop application receives image data input (camera, video file, screen capture), obtains frame data with associated time stamps, and delivers such to the SDK which runs the modules described herein (face detector, landmarks detector, PPG extractor, and pulse extractor), then delivers such to the desktop application. [0105] FIG. 11 is a schematic depicting an illustrative browser application utilizing server-client architecture, performing indicated steps in the cloud, in accordance with an illustrative embodiment. The browser application receives video input from a camera and/or media recorder, which is then transmitted through an HTTP server to cloud computing resources. The video is decoded, and each frame is processed in the SDK as described above. The HR is calculated in the cloud, and is then transmitted from the cloud via the HTTP server to the browser application, and can be exported as a data file.

[0106] FIG. 12 is a schematic depicting an illustrative standalone mobile application architecture that works in conjunction with an SDK, in accordance with an illustrative embodiment. Images from the Camera are processed in the mobile application, then frames with timestamps are processed as input by the SDK, which returns determined vital sign(s) (e.g., heart rate), to the mobile app. The mobile app may include a heart rate history chart, which is updated as new data is received. The results may then be exported from the mobile app, for example as a CSV data file, or other data format.

Network environment, computing devices, and software for use in various embodiments [0107] As shown in FIG. 13, an implementation of a network environment 400 for use in providing systems, methods, and architectures as described herein is shown and described. In brief overview, referring now to FIG. 13, a block diagram of an exemplary cloud computing environment 400 is shown and described. The cloud computing environment 400 may include one or more resource providers 402a, 402b, 402c (collectively, 402). Each resource provider 402 may include computing resources. In some implementations, computing resources may include any hardware and/or software used to process data. For example, computing resources may include hardware and/or software capable of executing algorithms, computer programs, and/or computer applications. In some implementations, exemplary computing resources may include application servers and/or databases with storage and retrieval capabilities. Each resource provider 402 may be connected to any other resource provider 402 in the cloud computing environment 400. In some implementations, the resource providers 402 may be connected over a computer network 408. Each resource provider 402 may be connected to one or more computing device 404a, 404b, 404c (collectively, 404), over the computer network 408.

[0108] The cloud computing environment 400 may include a resource manager 406. The resource manager 406 may be connected to the resource providers 402 and the computing devices 404 over the computer network 408. In some implementations, the resource manager 406 may facilitate the provision of computing resources by one or more resource providers 402 to one or more computing devices 404. The resource manager 406 may receive a request for a computing resource from a particular computing device 404. The resource manager 406 may identify one or more resource providers 402 capable of providing the computing resource requested by the computing device 404. The resource manager 406 may select a resource provider 402 to provide the computing resource. The resource manager 406 may facilitate a connection between the resource provider 402 and a particular computing device 404. In some implementations, the resource manager 406 may establish a connection between a particular resource provider 402 and a particular computing device 404. In some implementations, the resource manager 406 may redirect a particular computing device 404 to a particular resource provider 402 with the requested computing resource.

[0109] FIG. 14 shows an example of a computing device 500 and a mobile computing device 550 that can be used to implement the techniques described in this disclosure. The computing device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The mobile computing device 550 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, electronic tablets, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to be limiting.

[0110] The computing device 500 includes a processor 502, a memory 504, a storage device 506, a high-speed interface 508 connecting to the memory 504 and multiple highspeed expansion ports 510, and a low-speed interface 512 connecting to a low-speed expansion port 514 and the storage device 506. Each of the processor 502, the memory 504, the storage device 506, the high-speed interface 508, the high-speed expansion ports 510, and the low-speed interface 512, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 502 can process instructions for execution within the computing device 500, including instructions stored in the memory 504 or on the storage device 506 to display graphical information for a GUI on an external input/output device, such as a display 516 coupled to the high-speed interface 508. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system). Thus, as the term is used herein, where a plurality of functions are described as being performed by “a processor”, this encompasses embodiments wherein the plurality of functions are performed by any number of processors (one or more) of any number of computing devices (one or more). Furthermore, where a function is described as being performed by “a processor”, this encompasses embodiments wherein the function is performed by any number of processors (one or more) of any number of computing devices (one or more) (e.g., in a distributed computing system).

[0111] The memory 504 stores information within the computing device 500. In some implementations, the memory 504 is a volatile memory unit or units. In some implementations, the memory 504 is a non-volatile memory unit or units. The memory 504 may also be another form of computer-readable medium, such as a magnetic or optical disk. [0112] The storage device 506 is capable of providing mass storage for the computing device 500. In some implementations, the storage device 506 may be or contain a computer- readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. Instructions can be stored in an information carrier. The instructions, when executed by one or more processing devices (for example, processor 502), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices such as computer- or machine-readable mediums (for example, the memory 504, the storage device 506, or memory on the processor 502).

[0113] The high-speed interface 508 manages bandwidth-intensive operations for the computing device 500, while the low-speed interface 512 manages lower bandwidthintensive operations. Such allocation of functions is an example only. In some implementations, the high-speed interface 508 is coupled to the memory 504, the display 516 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 510, which may accept various expansion cards (not shown). In the implementation, the low- speed interface 512 is coupled to the storage device 506 and the low-speed expansion port 514. The low-speed expansion port 514, which may include various communication ports (e.g., USB, Bluetooth®, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter. [0114] The computing device 500 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 520, or multiple times in a group of such servers. In addition, it may be implemented in a personal computer such as a laptop computer 522. It may also be implemented as part of a rack server system 524. Alternatively, components from the computing device 500 may be combined with other components in a mobile device (not shown), such as a mobile computing device 550. Each of such devices may contain one or more of the computing device 500 and the mobile computing device 550, and an entire system may be made up of multiple computing devices communicating with each other.

[0115] The mobile computing device 550 includes a processor 552, a memory 564, an input/output device such as a display 554, a communication interface 566, and a transceiver 568, among other components. The mobile computing device 550 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage. Each of the processor 552, the memory 564, the display 554, the communication interface 566, and the transceiver 568, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.

[0116] The processor 552 can execute instructions within the mobile computing device 550, including instructions stored in the memory 564. The processor 552 may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor 552 may provide, for example, for coordination of the other components of the mobile computing device 550, such as control of user interfaces, applications run by the mobile computing device 550, and wireless communication by the mobile computing device 550.

[0117] The processor 552 may communicate with a user through a control interface 558 and a display interface 556 coupled to the display 554. The display 554 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 556 may comprise appropriate circuitry for driving the display 554 to present graphical and other information to a user. The control interface 558 may receive commands from a user and convert them for submission to the processor 552. In addition, an external interface 562 may provide communication with the processor 552, so as to enable near area communication of the mobile computing device 550 with other devices. The external interface 562 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used. [0118] The memory 564 stores information within the mobile computing device 550. The memory 564 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. An expansion memory 574 may also be provided and connected to the mobile computing device 550 through an expansion interface 572, which may include, for example, a SIMM (Single In Line Memory Module) card interface. The expansion memory 574 may provide extra storage space for the mobile computing device 550, or may also store applications or other information for the mobile computing device 550. Specifically, the expansion memory 574 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, the expansion memory 574 may be provide as a security module for the mobile computing device 550, and may be programmed with instructions that permit secure use of the mobile computing device 550. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.

[0119] The memory may include, for example, flash memory and/or NVRAM memory (non-volatile random access memory), as discussed below. In some implementations, instructions are stored in an information carrier. In some implementations, the instructions, when executed by one or more processing devices (for example, processor 552), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices, such as one or more computer- or machine- readable mediums (for example, the memory 564, the expansion memory 574, or memory on the processor 552). In some implementations, the instructions can be received in a propagated signal, for example, over the transceiver 568 or the external interface 562.

[0120] The mobile computing device 550 may communicate wirelessly through the communication interface 566, which may include digital signal processing circuitry where necessary. The communication interface 566 may provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA (time division multiple access), PDC (Personal Digital Cellular), WCDMA (Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others. Such communication may occur, for example, through the transceiver 568 using a radio-frequency. In addition, short-range communication may occur, such as using a Bluetooth®, Wi-Fi™, or other such transceiver (not shown). In addition, a GPS (Global Positioning System) receiver module 570 may provide additional navigation- and location- related wireless data to the mobile computing device 550, which may be used as appropriate by applications running on the mobile computing device 550.

[0121] The mobile computing device 550 may also communicate audibly using an audio codec 560, which may receive spoken information from a user and convert it to usable digital information. The audio codec 560 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 550. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on the mobile computing device 550.

[0122] The mobile computing device 550 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 580. It may also be implemented as part of a smartphone 582, personal digital assistant, or other similar mobile device.

[0123] Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

[0124] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine- readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.

[0125] The computer programs may include software that implement machine learning techniques, e.g., artificial neural networks (ANNs); e.g., convolutional neural networks (CNNs), random forest, decision trees, support vector machines, and the like, in order to determine, for a given input, one or more output values. In some embodiments, machine learning modules implementing machine learning techniques are trained, for example using curated and/or manually annotated datasets. Such training may be used to determine various parameters of machine learning algorithms implemented by a machine learning module, such as weights associated with layers in neural networks. In some embodiments, once a machine learning module is trained, e.g., to accomplish a specific task such as determining scoring metrics as described herein, values of determined parameters are fixed and the (e.g., unchanging, static) machine learning module is used to process new data (e.g., different from the training data) and accomplish its trained task without further updates to its parameters (e.g., the machine learning module does not receive feedback and/or updates). In some embodiments, machine learning modules may receive feedback, e.g., based on user review of accuracy, and such feedback may be used as additional training data, for example to dynamically update the machine learning module. In some embodiments, a trained machine learning module is a classification algorithm with adjustable and/or fixed (e.g., locked) parameters, e.g., a random forest classifier. In some embodiments, two or more machine learning modules may be combined and implemented as a single module and/or a single software application. In some embodiments, two or more machine learning modules may also be implemented separately, e.g., as separate software applications. A machine learning module may be software and/or hardware. For example, a machine learning module may be implemented entirely as software, or certain functions of a ANN module may be carried out via specialized hardware (e.g., via an application specific integrated circuit (ASIC), field programmable gate arrays (FPGAs), and the like).

[0126] As used herein, the terms “images”, “video”, “video stream”, and the like refer to the image data (e.g., pixel intensity values, pixel color component values (e.g., RGB, and the like), which are used to render a graphical image or sequential series of graphical images to be displayed (e.g., video). In certain embodiments, the image data received from a camera or other digital image recording device is processed as two-dimensional (2D) data. In other embodiments, the received image data is converted or mapped to three-dimensional (and/or two-and-a-half-dimensional) positions of a model. In other embodiments, the received image data is received as three-dimensional (3D) or two-and-a-half dimensional data (e.g., no conversion or mapping required).

[0127] To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) or LED (light-emitting diode) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

[0128] The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.

[0129] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0130] In some implementations, the software modules described herein can be separated, combined or incorporated into single or combined modules. The modules depicted in the figures are not intended to limit the systems described herein to the software architectures shown therein.

[0131] Elements of different implementations described herein may be combined to form other implementations not specifically set forth above. Elements may be left out of the processes, computer programs, databases, etc. described herein without adversely affecting their operation. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Various separate elements may be combined into one or more individual elements to perform the functions described herein. In view of the structure, functions and apparatus of the systems and methods described here, in some implementations.

[0132] It is contemplated that systems, architectures, devices, methods, and processes of the claimed invention encompass variations and adaptations developed using information from the embodiments described herein. Adaptation and/or modification of the systems, architectures, devices, methods, and processes described herein may be performed, as contemplated by this description.

[0133] Throughout the description, where articles, devices, systems, and architectures are described as having, including, or comprising specific components, or where processes and methods are described as having, including, or comprising specific steps, it is contemplated that, additionally, there are articles, devices, systems, and architectures of the present invention that consist essentially of, or consist of, the recited components, and that there are processes and methods according to the present invention that consist essentially of, or consist of, the recited processing steps.

[0134] It should be understood that the order of steps or order for performing certain action is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.

[0135] The mention herein of any publication, for example, in the Background section, is not an admission that the publication serves as prior art with respect to any of the claims presented herein. The Background section is presented for purposes of clarity and is not meant as a description of prior art with respect to any claim.

[0136] Headers are provided for the convenience of the reader - the presence and/or placement of a header is not intended to limit the scope of the subject matter described herein.

[0137] As used herein, the term “mimical motion” refers to motion of facial landmarks caused by movements of facial movements.

[0138] As used herein, the term “photoplethysmography” refers to an optical method for measuring and/or monitoring heart rate. The related term plethysmography refers to measuring changes in volume in parts of the body.

[0139] As used herein, the term “subject” refers to a human individual whose face is captured by a camera to form a video signal for analysis of the subject’s vital signs. In some embodiments, a subject may be a patient of a healthcare provider or practitioner. In some embodiments, the subject may be suffering from or may be susceptible to a disease, disorder, or condition. In some embodiments, the subject is the same individual as the user (see definition below), while in other embodiments, the subject is not the same individual as the user (see definition below). In certain embodiments, the camera is an imaging device that is part of a television (TV) set, a mobile device such as a smartphone, a laptop computer, a desktop computer, an electronic tablet, eyeglasses, a smart watch, a virtual reality and/or augmented reality (VR/AR) system, consumer furniture, bathroom equipment, kitchen equipment, or a safety helmet (e.g., in manufacturing), for example.

[0140] As used herein, the term “user” refers to an individual who performs, operates, and/or interacts with any of the systems and methods described in the present embodiments. In some embodiments, the user may take actions such as operating a video camera, positioning a camera, providing inputs or information to a computing device, receiving instructions or information from the computing device, interacting with a user interface, and/or taking further actions based on instructions from the computing device.

Claims

CLAIMS What is claimed is:

1. A system for real-time (including near-real-time) detection of one or more vital signs of a subject from a video stream depicting the subject, the system comprising: a processor of a computing device; and a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to: receive a digital signal corresponding to a series of frames of the video stream depicting the subject; for each of a plurality of the frames, identify a face region of interest (ROI) and identify a plurality of facial landmarks within the face ROI; extract photoplethysmography (PPG) data from the identified facial landmarks in the frames; and identify, in real-time (or near real-time), the one or more vital signs of the subject from the PPG data.

2. The system of claim 1, wherein the one or more vital signs comprise at least one member selected from the group consisting of a heart rate, a pulse, a peripheral capillary oxygen saturation (SpO2) level, a respiration rate, an emotional state, and a signal derived from one or more of the above.

3. The system of claim 1, wherein the video stream comprises image data from an RGB camera.

4. The system of claim 1, wherein the instructions to identify a face region of interest (ROI) comprises instructions to apply a face bounding box.

5. The system of claim 1, wherein a plurality of facial landmarks within the face ROI comprise a forehead region, a cheeks region, and an adaptive region.

6. The system of claim 1, wherein photoplethysmography (PPG) data is a lowdimensional signal that comprises one or more of (i), (ii), and (iii):

(i) Green: a mean of a green channel in a plurality of frames, (ii) CHROM: a color difference signal, assuming the subject has a standardized skintone, and

(iii) POS: a projection of the signal to a 2D plane orthogonal to the skin-tone in a temporally normalized RGB space.

7. The system of claim 1, wherein the instructions to identify the one or more vital signs from the PPG data comprises instructions to use a frequency transform and/or a Lomb- Scargle frequency transform.

8. The system of claim 1, wherein the instructions further comprise instructions to track, in real-time (or near real-time), the one or more vital signs over a time period contemporaneous with at least a portion of that depicted in the video stream.

9. The system of claim 1, wherein the instructions cause the processor to identify the one or more vital signs from the PPG data by performing a data signal stabilization based on a 2D frequency representation, ensuring a dominant component has connectivity over a sufficiently extended period of time.

10. The system of claim 9, wherein the 2D frequency representation comprises a 2D graphical spectrogram representation of the PPG data.

11. The system of any one of the preceding claims, wherein the instructions cause the processor to apply (i) light compensation and/or stabilization, and (ii) motion compensation and/or stabilization to the plurality of frames of the video stream.

12. The system of any one of the preceding claims, wherein the processor utilizes a multithreading architecture to perform parallel processing of data from different sequences of frames of the video stream.

13. The system of any one of the preceding claims, wherein the instructions further cause the processor to identify a poor light condition from the frames of the video stream and, upon identification of said poor light condition, render a prompt to a user via a user interface indicating said poor light condition.

14. The system of claim 13, wherein the instructions render a prompt to the user via the user interface recommending the user find a better lit environment before proceeding.

15. The system of any one of the preceding claims, wherein the instructions further cause the processor to identify excessive motion of the subject from the frames of the video stream which cannot be accurately compensated for and, upon identification of said excessive motion, render a prompt to a user via a user interface indicating said motion.

16. The system of claim 15, wherein the instructions cause the processor to render a prompt to the user via the user interface recommending the subject to stay still.

17. The system of any of the preceding claims, wherein the subject is the user.

18. The system of any of the preceding claims, wherein the subject is not the user.

19. A method for real-time (including near-real-time) automated detection of one or more vital signs of a subject from a video stream depicting the subject, the method comprising: receiving, by a processor of a computing device, a digital signal corresponding to a series of frames of the video stream depicting the subject; for each of a plurality of the frames, identifying, by the processor, a face region of interest (ROI) and identifying a plurality of facial landmarks within the face ROI; extracting, by the processor, photoplethysmography (PPG) data from the identified facial landmarks in the frames; and identifying, by the processor, in real-time (or near real-time), the one or more vital signs of the subject from the PPG data.

20. The method of claim 19, wherein the one or more vital signs comprise at least one member selected from the group consisting of a heart rate, a pulse, a peripheral capillary oxygen saturation (SpCh), a respiration rate, an emotional state, and a signal derived from one or more of the above.

21. The method of claim 19, wherein the video stream comprises image data from an

RGB camera.

22. The method of claim 19, wherein identifying a face region of interest (ROI) comprises applying a face bounding box.

23. The method of claim 19, wherein a plurality of facial landmarks within the face ROI comprises a forehead region, a cheeks region, and an adaptive region.

24. The method of claim 19, wherein photoplethysmography (PPG) data from the identified facial landmarks in the frames is a low-dimensional signal that comprises one or more of (i), (ii), and (iii) as follows:

(i) Green: a mean of a green channel in a plurality of frames,

(ii) CHROM: a color difference signal, assuming the user has a standardized skintone, and

25. The method of claim 19, wherein identifying the one or more vital signs from the PPG data comprises using a frequency transform and/or a Lomb-Scargle frequency transform.

26. The method of claim 19, the method further comprising tracking in real-time or near real-time the one or more vital signs over a time period contemporaneous with at least a portion of that depicted in the video stream.

27. The method of claim 19, wherein identifying the one or more vital signs from the PPG data comprises performing a data signal stabilization based on a 2D frequency representation, ensuring a dominant component has connectivity over a sufficiently extended period of time.

28. The method of claim 27, wherein the 2D frequency representation comprises a 2D graphical spectrogram representation of the PPG data

29. The method of any one of claims 19 to 28, comprising applying, by the processor, (i) light compensation and/or stabilization, and (ii) motion compensation and/or stabilization to the plurality of frames of the video stream.

30. The method of any one of claims 19 to 29, wherein the processor utilizes a multithreading architecture to perform parallel processing of data from different sequences of frames of the video stream.

31. The method of any one of claims 19 to 30, comprising identifying, by the processor, a poor light condition from the frames of the video stream and, upon identification of said poor light condition, rendering a prompt to a user via a user interface indicating said poor light condition.

32. The method of claim 31, the method comprising rendering a prompt to the user via the user interface recommending the user find a better lit environment before proceeding.

33. The method of any one of claims 19 to 32, comprising identifying, by the processor, excessive motion of the subject from the frames of the video stream which cannot be accurately compensated for and, upon identification of said excessive motion, rendering a prompt to a user via a user interface indicating said motion.

34. The method of claim 33, the method further comprising rendering a prompt to the user via the user interface recommending the subject to stay still.

35. The method of any one of claims 19 to 34, wherein the subject is the user.

36. The method of any one of claims 19 to 34, wherein the subject is not the user.