CN111597891B

CN111597891B - Heart rate detection method based on multi-scale video

Info

Publication number: CN111597891B
Application number: CN202010285626.7A
Authority: CN
Inventors: 赵昶辰; 韩蔚然; 冯远静; 赵志明; 梅培义; 居峰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2023-07-25
Anticipated expiration: 2040-04-13
Also published as: CN111597891A

Abstract

A heart rate detection method based on multi-scale video comprises the following steps: step 1, establishing a video pyramid: on the basis of the original tracking frame, the size of the region of interest (ROI) is reduced downwards, and the size of the region of interest is enlarged upwards; step 2, blood volume pulse BVP signal extraction: the blood volume pulse signals need to be extracted from a multi-scale channel, the multi-scale signal fusion algorithm is popularization and optimization of a skin tone orthogonal plane method, and when the scale features of the pyramid are in one layer, the proposed heart rate extraction method is the skin tone orthogonal plane method; step 3, multi-scale signal fusion: and the heart rate signal characteristics of the channels with multiple scales are subjected to signal fusion by adopting Gaussian prior convex combination, and the heart rate value is finally obtained by signal processing of the fused signals with multiple scales. The invention extracts rich heart rate characteristics on different scales by changing the frequency, and fuses the characteristics so as to improve the heart rate detection precision.

Description

Heart rate detection method based on multi-scale video

Technical Field

The invention relates to the fields of video heart rate detection, computer vision and signal processing.

Background

In recent years, remote photoplethysmography (rpg) based on optical and physiological principles has been rapidly developed, which is a technique for measuring Blood Volume Pulse (BVP) and heart rate in a non-contact manner, and has a very wide range of applications. In visible light, the remote photoplethysmography adopts a consumer-level digital camera to measure heart rate, thereby expanding the application range of pulse measurement. The selection of a facial region of interest (ROI) is a key issue for the system, and the selection of the location of the region of interest directly affects the quality of the measurement signal. Existing studies indicate that the cheek, lip and chin regions of the face contain more abundant capillaries and have better pulse signal intensity than other regions. However, areas of higher pulsation intensity are not necessarily suitable for original signal extraction for remote photoplethysmography, as these areas may be disturbed by non-rigid movements like blinking, speaking, smiling, etc. Existing video heart rate detection methods all extract heart rate signals based on a single region of interest, however the heart rate signal characteristics of a single region of interest are always limited.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a heart rate detection method based on a multi-scale video, and the basic idea of the method is to extract heart rate signal characteristics on different scales in the video by changing frequency and fuse the characteristics so as to improve heart rate detection accuracy.

The invention adopts the technical scheme that:

a heart rate detection method based on multi-scale video, the method comprising the steps of:

step 1, establishing a video pyramid

On the basis of the original tracking frame, on one hand, the size of the region of interest (ROI) is reduced downwards, and on the other hand, the size of the region of interest (ROI) is enlarged upwards;

step 2, blood volume pulse BVP Signal extraction

The blood volume pulse signals need to be extracted from a multi-scale channel, a multi-scale signal fusion algorithm is popularization and optimization of a skin tone orthogonal plane method POS, and when the scale characteristics of the pyramid are in the condition of only one layer, the proposed heart rate extraction method is POS;

step 3, multi-scale signal fusion

And the heart rate signal characteristics of the channels with multiple scales are subjected to signal fusion by adopting Gaussian prior convex combination, and the heart rate value is finally obtained by signal processing of the fused signals with multiple scales.

Further, in the step 1, the key of creating the video pyramid is a multi-scale region of interest of the face, and ω is set _l 、h _l For the width and height of the level I facial region of interest ROI, the multi-scale facial ROI is defined as follows:

the number of levels of the pyramid, l= -1,0,1 …, where ω ₀ 、h ₀ For the width and height of the layer 0 region of interest ROI, the facial ROIs of different scales share the same center point (C _x ，C _y ) The ROI of layer 0 is defined as a bounding box immediately surrounding the face contour, which can be obtained by commonly used face detectors and trackers; to construct a multiscale facial ROI, on the one hand, the initial ROI size is halved stepwise, i.e. i=1, 2,3 … these ROIs cover skin areas and do not involve background pixels; since these regions of interest cover different areas of the face, there are correspondingly different color changes; on the other hand, the size of the initial ROI is enlarged, i.e. let l= -1, in order to apply the signalWhen an algorithm is extracted, taking the pixel gray level of a motion background in a video into consideration;

the pixels in the facial multiscale ROI are converted into the original rpg trajectory by spatial averaging, calculated by:

the number of pyramid levels in the formula, i= -1,0,1 …, where I _c (x, y, t) represents the gray scale of the pixel of the t-th frame with coordinates (x, y), c ε { R, G, B } represents the color channel, R _l (t) represents a region of interest ROI of the t-th frame of the first layer, area (R) _l (t)) is denoted as R _l Total number of pixels in (t). The ROI tracking frame of each frame of the video is determined by the face tracker, and the original rpg trajectory is obtained by averaging the pixel gray levels I _l，C (t) is calculated by linking to the whole t and is denoted as I _l (t)。

Still further, in the step 2, the blood volume pulse BVP signal of each layer scale needs to be extracted from the trajectory of remote photoplethysmography rpg, respectively, and with the help of the facial multiscale ROI, the motion artifact is partially separated, as follows:

using the skin tone orthogonal plane method POS, it defines a vector of correlations [1,1 ]] ^T Orthogonal projection planes to eliminate skin tone dependence, to separate BVP signals and motion artifacts, the original trajectories are projected onto two vectors on this plane, respectively, signal processing is performed by alpha-tuning to finally obtain BVP signals, POS is applied to the original tracking trajectories I of each level of the video pyramid _l (t), therefore, the original trace I is temporarily omitted from the following formula _l The number of levels/of (t), which is simply denoted as I (t);

the original tracking is first processed by a time normalization,

I _n (t)＝N·I(t) (3)

wherein the method comprises the steps ofIs a diagonal matrix whose ith diagonal gives the inverse of the ith row mean of I, namely:

N _ii ＝1/μ(I _i ) (4)

the time-normalized trajectory is then projected onto two vectors defined by a projection matrix, P _p ＝[0，1，-1；-2，1，1]Wherein each row represents a mutually orthogonal projection axis, the projection signal is expressed as:

S ₁ (t)＝I _nG (t)-I _nB (t) (5)

S ₂ (t)＝I _nG (t)+I _nB (t)-2I _nR (t) (6)

in order to separate the specular reflection and the pulse signal component, S ₁ (t) and S ₂ (t) alpha-tuning treatment is required,

x(t)＝S ₁ (t)+αS ₂ (t) (7)

wherein α=σ (S ₁ (t))/σ(S ₂ (t)), while σ (·) represents the standard deviation, x (t) is the extracted BVP signal, also known as the POS feature, applying a POS algorithm to each level of the video pyramid results in an L-level BVP signal, where L represents the total number of layers of the video pyramid, and x is _l (t) POS features x (t) extracted for the first layer;

linear POS operation can only extract limited pure BVP signals in all sports and recording environments. While with the help of the multiscale facial ROI, motion artifacts are partially separated.

Furthermore, the blood volume pulse BVP signals of each layer scale obtained in the step 1 and the step 2 need to be fused by a convex combination of gaussian priors, and the process is as follows:

the final pulse signal is calculated by fusing the candidate POS features extracted from the multi-scale trajectory, since the candidate POS features are assumed to be complementary, we attribute the signal fusion problem to feature combinations instead of feature selection, and for this purpose, convex combinations are used:

λ _l representing the weight of the first hierarchical scale and satisfying the relation of the hierarchical weights and equal to 1, i.e. Σ _l λ _l The next most critical step, i.e., =1, is to determine the weight of each level, POS features of different levels have different heart rate energies, larger weights should be assigned to those levels with stronger heart rate energies, weights are determined using gaussian priors,

l= -1,0,1,2,3..represents the first layer scale, μ ₀ Sum sigma ₀ The exponential levels representing center and standard deviation, respectively, the gaussian priors are based on the pulse intensity in the middle layer being greater than the intensities in the lower and higher layers;

for long-time detection videos, windowing outputs are connected in series to obtain a long-time heart rate signal, a video with the length of N is given, a sequence is firstly divided into segments with the length of T, a proposed algorithm is applied to obtain windowed outputs, and an overlap-add method is applied to obtain final heart rate outputs.

The beneficial effects of the invention are as follows: and the heart rate detection accuracy is improved.

Detailed Description

The present invention will be described in further detail below.

step 1, establishing a video pyramid

The key to creating a video pyramid is a multi-scale facial region of interest, set ω _l 、h _l For the width and height of the level I facial region of interest ROI, the multi-scale facial ROI is defined as follows:

the number of levels of the pyramid, l= -1,0,1 …, where ω ₀ 、h ₀ Is the width and height of the layer 0 region of interest ROI. Facial ROIs of different scales share the same center point (C _x ，C _y ) The ROI of layer 0 is defined as a bounding box immediately surrounding the face contour, which can be obtained by a commonly used face detector and tracker, in order to construct a multi-scale facial ROI, on the one hand, the initial ROI size is halved stepwise, i.e. i=1, 2,3, … these ROIs mainly cover skin areas and do not involve background pixels, as these areas of interest cover different areas of the face, and correspondingly there are also different color changes; on the other hand, the size of the initial ROI is enlarged, i.e. let l= -1, in order to take into account the pixel gray-scale of the moving background in the video when applying the signal extraction algorithm;

the number of pyramid levels in the formula, i= -1,0,1 …, where I _c (x, y, t) represents the gray scale of the pixel of the t-th frame with coordinates (x, y), c ε { R, G, B } represents the color channel, R _l (t) represents a region of interest ROI of the t-th frame of the first layer, area (R) _l (t)) is denoted as R _l Total number of pixels in (t). The ROI tracking frame of each frame of the video is determined by the face tracker, and the original rpg trajectory is obtained by averaging the pixel gray levels I _l，C (t) is calculated by linking to the whole t and is denoted as I _l (t)；

Step 2 Blood Volume Pulse (BVP) Signal extraction

The Blood Volume Pulse (BVP) signal needs to be extracted from the trajectory of multiscale remote photoplethysmography (rPPG), where we use skin tone positive which has been widely adopted by various rPPG research instituteAn intersection plane method POS, which defines an and vector [1,1 ]] ^T Orthogonal projection planes to eliminate skin tone dependence, to separate BVP signals and motion artifacts, the original trajectories are projected onto two vectors on this plane, respectively, signal processing is performed by alpha-tuning to finally obtain BVP signals, POS is applied to the original tracking trajectories I of each level of the video pyramid _l (t), therefore, the original trace I is temporarily omitted from the following formula _l The number of levels of (t), I (t), is simply denoted as I (t).

The original trajectory is first time-normalized,

I _n (t)＝N·I(t) (3)

N _ii ＝1/μ(I _i ) (4)

the time-normalized trajectory is then projected onto two vectors defined by a projection matrix, P _p ＝[0，1，-1；-2，1，1]Wherein each row represents a projection axis that is orthogonal to each other. The projection signal can be expressed as:

S ₁ (t)＝I _nG (t)-I _nB (t) (5)

S ₂ (t)＝I _nG (t)+I _nB (t)-2I _nR (t) (6)

x(t)＝S ₁ (t)+αS ₂ (t) (7)

the application of POS in multi-scale tracking is popularization of an original POS algorithm, and the purpose of multi-scale POS feature extraction is to facilitate pulse extraction. In conventional single-scale extraction, all motion artifacts are accompanied by the BVP signal. Linear POS operation can only extract limited pure BVP signals in all sports and recording environments. While with the help of a multi-scale facial ROI, motion artifacts may be partially separated;

step 3. Multi-Scale Signal fusion

l= -1,0,1,2,3..represents the first layer scale, u ₀ Sum sigma ₀ The exponential levels representing center and standard deviation, respectively, the gaussian priors are based on the pulse intensity in the middle layer being greater than the intensities in the lower and higher layers;

the operations discussed above are all within the scope of a time window. For long-time detection videos, window outputs are connected in series to obtain a long-time heart rate signal; specifically, given a video of length N, we first split the sequence into segments of length T, apply the proposed algorithm to obtain windowed output, and apply the Overlap-add (overlay) method to obtain the final heart rate output.

Claims

1. A heart rate detection method based on multi-scale video, the method comprising the steps of:

step 1, establishing a video pyramid

step 2, blood volume pulse BVP Signal extraction

The blood volume pulse signals need to be extracted from a multi-scale channel, a multi-scale signal fusion algorithm is popularization and optimization of a skin tone orthogonal plane method POS, and when the scale characteristics of the pyramid are in a single layer, the proposed heart rate extraction method is the skin tone orthogonal plane method POS;

step 3, multi-scale signal fusion

The heart rate signal characteristics of the multiple scale channels are subjected to signal fusion by adopting Gaussian prior convex combination, and the heart rate value is finally obtained by signal processing of the fusion signals of the multiple scales;

the blood volume pulse BVP signals of each layer scale obtained in the step 1 and the step 2 need to be fused through a Gaussian prior convex combination, and the process is as follows:

the final pulse signal is calculated by fusing the candidate POS features extracted from the multi-scale trajectory, and since the candidate POS features are assumed to be complementary, the signal fusion problem is attributed to feature combinations rather than feature selection, and for this purpose, convex combinations are used:

l is the number of layers of the pyramid, lambda _l Representing the weight of the first hierarchical scale and satisfying the relation of the hierarchical weights and equal to 1, i.e. Σ _l λ _l ＝1，x _l (t) POS features x (t) extracted for the first tier, the next most critical step being to determine the weights of each level, POS features of different tiers having different heart rate energies, larger weights being assigned to those tiers having stronger heart rate energies, weights being determined using Gaussian priors,

l= -1,0,1,2,3 … represents the layer l scale, μ ₀ Sum sigma ₀ The exponential levels representing center and standard deviation, respectively, the gaussian priors are based on the pulse intensity in the middle layer being greater than the intensities in the lower and higher layers;

2. The multi-scale video-based heart rate detection method of claim 1, wherein: in the step 1, the key point of establishing the video pyramid is a multi-scale face region of interest, and omega is set _l 、h _l For the width and height of the level I facial region of interest ROI, the multi-scale facial ROI is defined as follows:

the number of levels of the pyramid, l= -1,0,1 …, where ω ₀ 、h ₀ For the width and height of the layer 0 region of interest ROI, faces of different dimensionsThe partial ROIs share the same center point (C _x ,C _y ) The ROI of layer 0 is defined as a bounding box immediately surrounding the face contour, which can be obtained by commonly used face detectors and trackers; to construct a multiscale facial ROI, on the one hand, the initial ROI size is halved stepwise, i.e. i=1, 2,3 … these ROIs cover skin areas and do not involve background pixels; since these regions of interest cover different areas of the face, there are correspondingly different color changes; on the other hand, the size of the initial ROI is enlarged, i.e. let l= -1, in order to take into account the pixel gray-scale of the moving background in the video when applying the signal extraction algorithm;

the number of pyramid levels in the formula, i= -1,0,1 …, where I _c (x, y, t) represents the gray scale of the pixel of the t-th frame with coordinates (x, y), c ε { R, G, B } represents the color channel, R _l (t) represents a region of interest ROI of the t-th frame of the first layer, area (R) _l (t)) is denoted as R _l Total number of pixels in (t), ROI tracking frame of each frame of video is determined by face tracker, original rpg trajectory is obtained by averaging pixel gray level I _l,C (t) is calculated by linking to the whole t and is denoted as I _l (t)。

3. A multi-scale video-based heart rate detection method as claimed in claim 1 or 2, wherein: in the step 2, the blood volume pulse BVP signal of each layer scale needs to be extracted from the trajectory of remote photoplethysmography rpg, and the motion artifact is partially separated with the help of the facial multiscale ROI, as follows:

using the skin tone orthogonal plane method POS, it defines a vector of correlations [1,1 ]] ^T Orthogonal projection planes to eliminate skin tone dependence, to separate BVP signals and motion artifacts, the original trajectory is splitOnto the two vectors respectively projected onto this plane, signal processing is performed by α -tuning, so as to finally obtain the BVP signal, and POS is applied to the original tracking trajectory I of each level of the video pyramid _l (t), therefore, the original trace I is temporarily omitted from the following formula _l The number of levels/of (t), which is simply denoted as I (t);

the original tracking is first processed by a time normalization,

I _n (t)＝N·I(t) (3)

N _ii ＝1/μ(I _i ) (4)

S ₁ (t)＝I _nG (t)-I _nB (t) (5)

S ₂ (t)＝I _nG (t)+I _nB (t)-2I _nR (t) (6)

x(t)＝S ₁ (t)+αS ₂ (t) (7)

linear POS operation can only extract limited pure BVP signals in all motion and recording environments, while motion artifacts are partially separated with the help of a multi-scale facial ROI.