CN115024706A

CN115024706A - Non-contact heart rate measurement method integrating ConvLSTM and CBAM attention mechanism

Info

Publication number: CN115024706A
Application number: CN202210529339.5A
Authority: CN
Inventors: 戎舟; 王宇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-09-09

Abstract

The invention discloses a non-contact heart rate measuring method fusing ConvLSTM and CBAM attention mechanisms, which comprises the steps of extracting an interested area image sequence in a face video by adopting a face key point detection model, resampling a physiological signal in a data set by taking a video frame rate as a sampling rate, segmenting the video and the physiological signal by adopting a sliding window method, and manufacturing a face interested area video segment and a data set of the physiological signal corresponding to the face interested area video segment; constructing a deep learning network model fusing ConvLSTM and CBAM attention mechanism, wherein ConvLSTM can learn space-time characteristics among video sequences, and the CBAM attention mechanism can strengthen information of important channels in a characteristic diagram and regions with stronger attention physiological signals; the manufactured data set is sent to a network for training, and an optimal model is selected after training.

Description

Non-contact heart rate measurement method integrating ConvLSTM and CBAM attention mechanism

Technical Field

The invention relates to a non-contact heart rate measurement method integrating ConvLSTM and CBAM attention mechanisms, and belongs to the field of video analysis.

Background

The heart rate is one of the most basic and important physiological information in a plurality of physiological information, and is the most main physiological index for judging whether the heart is healthy or not. Traditional heart rate check out test set not only need closely laminate equipment and human body like electrocardiogram, pulse oximetry, lets the people produce the discomfort easily, needs the professional to operate moreover, is unfavorable for real-time and long-term monitoring rhythm of the heart. Compared with the traditional contact measurement, the video-based non-contact measurement scheme is more comfortable, flexible and convenient, has few limited conditions and is suitable for more occasions.

Hemoglobin in human blood has certain absorptive capacity to light, and the process of heartbeat can be constantly to pumping blood and drawing blood in the blood vessel for hemoglobin's content in the blood can change along with the heartbeat, and the absorptive capacity to light also changes thereupon, contains rich capillary under human facial skin, and when the light struck when the upper surface, its colour can take place weak change, and although people's eye can not see, but can record through RGB camera, this is the theoretical foundation of obtaining the rhythm of the heart through the people's face.

Early non-contact heart rate measurement methods generally divided into the following steps: detecting a human face and extracting an ROI; physiological signal extraction, including blind source separation based algorithms, such as independent component analysis, and optical reflectance model based methods, such as color difference methods and skin plane orthogonality based methods; physiological signal processing, such as methods of detrending, normalization, band-pass filtering, and the like; and estimating the heart rate signal, namely performing Fourier transform on the extracted physiological signal to convert the extracted physiological signal into a frequency domain, calculating a frequency value of the occurrence of a peak, or directly performing peak detection on the signal to obtain the final heart rate.

However, studies have shown that there are many problems that are difficult to solve in the conventional non-contact heart rate detection, such as: uncontrollable factors such as illumination change and motion cause great difficulty in video analysis.

Deep learning is a popular research direction in the field of machine learning in recent years, and has been greatly successful in the fields of computer vision, natural language processing and the like. Particularly in the aspect of image feature extraction, the deep learning model has very strong capability of processing complex data, and meanwhile, due to the introduction of a attention mechanism, the features of useful areas in the images can be highlighted, and the features of useless areas can be weakened, so that the deep learning can be applied to a non-contact heart rate measurement task.

Disclosure of Invention

In order to solve the problems in the prior art, a non-contact heart rate measurement method fused with ConvLSTM and CBAM attention mechanisms is provided to improve the robustness and generalization capability of non-contact heart rate measurement.

The invention adopts the following technical scheme for solving the technical problems:

a non-contact heart rate measurement method fusing ConvLSTM and CBAM attention mechanisms comprises the following specific steps:

acquiring a data set containing a face video and a physiological signal label, and preprocessing the data set;

constructing a network model fusing a ConvLSTM and CBAM attention mechanism;

predicting a non-contact heart rate rPPG signal by using the trained network model;

filtering the prediction result by using a Butterworth band-pass filter;

and calculating the power spectral density of the filtered rPPG signal to complete the measurement of the rPPG.

Further, the pretreatment steps are as follows:

1.1, using a face key point detection model to detect face videos in a data set frame by frame, and cutting out ROI (region of interest) images in the face videos, so as to obtain an ROI image sequence X ═ { X ═ ₀ ,x ₁ ,…,x _t ,…,X _T-1 }，x _t Representing an ROI (region of interest) image corresponding to the T frame of the face video, wherein T represents the total frame number of the face video;

1.2, resampling the physiological signal label by taking the human face video frame rate as a sampling rate to obtain a physiological signal label sequence P ═ { P ═ P ₀ ,p ₁ ,…,p _t ,…,p _T-1 }；

And 1.3, segmenting the ROI image sequence and the physiological signal label sequence by adopting a sliding window.

Further, the network model fusing the ConvLSTM and CBAM attention mechanism sequentially comprises: the system comprises a first volume layer, a first ConvLSTM layer, N cascaded volume blocks fused with a CBAM attention mechanism, a second ConvLSTM layer, a full connection layer and a Dropout layer; the convolution block fused with the CBAM attention mechanism sequentially comprises a second convolution layer, a first BN layer, a first activation function layer, a third convolution layer, a second BN layer, a second activation function layer and a CBAM layer.

Further, the first convolution layer includes a two-dimensional convolution layer having a convolution kernel size of [7,7] and a Maxpool layer.

Further, the negative pearson correlation coefficient is selected as a loss function of the network model:

wherein T represents the total frame number of the face video, x _t Representing the prediction of the rPPG signal of the t-th frame, y represents x _t A corresponding physiological signal tag.

Further, the calculation formula of the power spectral density of the filtered rPPG signal is as follows:

wherein, f (x) _n (ii) a θ) represents the output of the network model, x _n Represents the nth picture of the input network model, theta represents all parameters of the network model,

frequency, f, of a label representing a physiological signal _s Representing the human face video frame rate.

Compared with the prior art, the technical scheme adopted by the invention has the following technical effects:

1) in the prior art, a face interesting region image sequence corresponding to a training video frame sequence and a standard average heart rate value thereof are generally used as training samples, loss is calculated through a cross entropy loss function, and for some training, for example, the method is used for measuring the heart rate after a person is in a healthy state, and the average heart rate value cannot well measure the heart rate result of the person within a set time, so that an rPPG signal within the set time is regressed based on pearson correlation loss, the change trend of a physiological signal within the set time can be better simulated, the heart rate can be better measured, and the measurement accuracy is improved;

2) the ConvLSTM is used for fusing a CBAM attention mechanism, so that the spatial characteristics of video frames and the time characteristics between frames can be better learned, the most obvious region of physiological signal characteristics in the ROI can be more concerned, the interference of invalid information is reduced, and the robustness of the network is improved.

Drawings

FIG. 1 is a flow chart of a non-contact heart rate measurement according to the present invention;

FIG. 2 is a structural diagram of a non-contact heart rate measurement model integrating ConvLSTM and CBAM attention mechanism according to the present invention;

FIG. 3 is a block diagram of a convolution block incorporating the CBAM attention mechanism of the present invention;

FIG. 4 is a diagram of a CBAM attention mechanism.

Detailed Description

The technical solution of the present invention is further explained with reference to the accompanying drawings and the specific embodiments.

In one embodiment, a non-contact heart rate measurement method fusing ConvLSTM and CBAM attention mechanisms is provided, and is characterized in that a data set containing face videos and physiological signal labels is obtained, preprocessed and trained to form a network model, and non-contact heart rate measurement is achieved. The invention can effectively extract the rPPG signal and calculate the heart rate from the rPPG signal. As shown in fig. 1, the method specifically comprises the following steps:

1. and acquiring a data set containing the face video and the physiological signal label, including a VIPL-HR data set and a PURE data set, and preprocessing the data set.

1.1 obtaining face key points by using a face key point detection model, cutting out face regions above lips and below eyes from the key points to be used as ROI regions, and zooming an ROI region image into 128x128 pixels by adopting a Lanczos interpolation method in the 8x8 pixel field. For the frame of which the motion amplitude is small but the detector can not detect the human face in the data set, replacing the frame with the detection data of the previous frame; and discarding the frame with too large motion amplitude to obtain the ROI image only containing less facial information, and continuing to detect from the next frame to finally obtain the ROI video sequence.

1.2, resampling the physiological signal label by taking the video frame rate as a sampling rate, wherein the length of the resampled signal is the same as the frame number of the video in the previous step, and aligning the video frame and the physiological signal label.

1.3 to increase the data size, the ROI video sequence and the tag signal are segmented with a length of 128 and a step size of 30 to obtain a training ROI video sequence X ═ { X ═ ₀ ,x ₁ ,…,x _t ,…,x _T-1 And a tag heart rate signal P ═ P ₀ ,p ₁ ,…,p _t ,…,p _T-1 Where T is 128, x _t Is 128x128x 3.

2 train the neural network model that fuses the ConvLSTM and CBAM attention mechanisms as shown in FIG. 2.

2.1 set Batch _ size to 16, read data size to [ B, C, T, W, H ]. Wherein B represents Batch _ size and C represents the number of image channels, i.e., 3; t represents the number of video frames of one sample, i.e., 128; w, H represent the width and height of the frame, i.e., 128x128, respectively. To meet the 2D convolution computation requirements, the data needs to be transformed into 4-dimensional tensors, i.e. [ B x T, C, W, H ], before being input into the model.

2.2 the data is input into a convolution block with kernel size 7x7, step size 1 and padding 1, and is responsible for mapping the original skin image to the pulse feature space. Wherein the volume block further comprises a BatchNorm layer and a ReLU active layer in addition to the above-described volume layer. And then, extracting important data from the characteristic diagram through a Maxpool layer and simultaneously reducing dimensions to obtain a pulse characteristic diagram.

2.3 construction of ConvLSTM1, in this example using ConvLSTM layers in the second layer core and the penultimate layer, the main difference being that the input and output channel parameters of each layer are different, the rest using the same parameters. The basic ConvLSTM construction parameters are as follows: ConvLSTM1 ═ ConvLSTM (input _ dims, [1,1, output _ dims ], (1,1), num _ layers ═ 3, batch _ first ═ True, bias ═ True), where the convolution kernel size is 1x1, input _ dims represents the number of output channels of the previous layer, and [1,1, output _ dims ] represents the hidden layer dimension, where output _ dims ═ input _ dims is set in the layer, num _ layers represents the number of layers of ConvLSTM, i.e., three layers, where the hidden layer dimension in the first and second layers is 1, and the last hidden layer dimension is output _ dims, i.e., the number of output channels is output _ dims. Inputting the feature map in the previous step into the convlstm1 layer extracts the spatio-temporal features of the feature map without changing the size of the feature map.

2.4 build 4 layers of convolutional layers of the fused CBAM attention mechanism as shown in FIG. 3, each layer comprising two convolutional blocks and a CBAM attention mechanism module consisting of a channel attention mechanism and a space attention mechanism as shown in FIG. 4. Each convolution block contains a convolution layer, a normalization layer, and an activation function layer. Wherein, the convolution blocks of all layers adopt convolution kernels of 3x 3; the filling of the first volume block and the second volume block of the last layer is 1, and the step length is 1; and the other layers adopt a structure that the step length of the first convolution block is 2, the filling is 1, the step length of the second convolution block is 1, and the filling is 1. And (3) after the feature map input in the previous layer passes through two convolution blocks, obtaining a new feature map [ B, T, C ', H', W '], and sequentially passing through a channel attention mechanism and a space attention mechanism for each feature map F with the size [ C', H ', W' ].

Further, the channel attention mechanism process comprises:

1, respectively carrying out global maximum pooling and global average pooling on the height and the width of an input feature map F (H '× W' × C ') to obtain two feature maps of 1 × 1 × C';

2 they are then fed into a two-layer neural network (MLP), the first layer having a neuron number C '/r (r is the rate of decrease), the activation function Relu, the second layer having a neuron number C', which is shared.

And 3, performing addition operation based on element-wise (corresponding element addition) on the features output by the MLP, and performing sigmoid activation operation to generate final channel attention features, namely M _ C (1 × 1 × C'). Finally, element-wise multiplication is carried out on the M _ C and the input feature diagram F to generate input features F '(H' × W '× C') needed by a subsequent spatial attention module. The concrete formula is as follows:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

in the formula, AvgPool represents average pooling, MaxPool represents maximum pooling, F represents input feature map, M _c Representing a channel attention map. MLP represents a two-layer neural network, and σ represents a sigmoid activation function.

Further, the spatial attention mechanism process is as follows:

and performing global average pooling and global maximum pooling on the feature map F ' (H ' × W ' × C ') in a channel dimension to obtain two (H ' × W ' × 1) feature maps, performing convolution operation by using a 7 × 7 convolution kernel, performing sigmoid activation to obtain a two-dimensional spatial attention map M _ s (H ' × W ' × 1), and performing element-by-element multiplication on the two-dimensional spatial attention map M _ s and the input feature map F ' to obtain an optimized feature map F ' (H ' × W ' × C '). The concrete formula is as follows:

M _s (F)＝σ(f ^7x7 ([AvgPool(F)；MaxPool(F)]))

wherein F represents the input profile, AvgPool represents the mean pooling, MaxPool represents the maximum pooling, F ^7x7 Represents convolution layer with convolution kernel size of 7x7, and sigma represents sigmoid activation function.

2.5 construct ConvLSTM2 according to the method in step 2.3 based on the output feature map dimension of the previous layer, further emphasizing the spatio-temporal features of the feature map.

2.6 the obtained characteristics are regressed to the rPPG signal through the full connection layer, and as the full connection layer has larger parameters and is easy to be over-fitted, a Droupout layer is added behind the full connection layer to reduce over-fitting and increase the generalization capability of the model. The resulting predicted rPPG signal B with output B, T represents the trained Batch size, T represents the number of frames of the video clip.

2.7 calculating loss according to a negative Pearson correlation coefficient calculation formula. As shown in the following formula:

wherein T represents the number of frames contained in a video segment, x _t Representing the prediction of the rPPG signal of the t-th frame, y represents x _t A corresponding physiological signal tag.

And 2.8 repeating the process in the step 2, optimizing the model parameters by adopting an Adam optimizer, and setting the learning rate to be 0.0001 until the loss is not reduced in continuous 5 periods to obtain the optimized network model.

3, normalizing the rPPG signal output by the model, and then filtering by using a Butterworth band-pass filter to remove the bandwidth interval (H) of the normal heart rate of the human body _min ，H _max ) Noise and harmonic signals outside the range, and obtaining a filtered rPPG signal R ═ R ₀ ,r ₁ ,…,r _t ,…,r _T-1 }。

4 calculating the power spectral density PSD of the filtered rPPG signal according to the following formula:

in the formula, f (x) _n (ii) a θ) represents the output (x) of the network _t )，x _n Represents the input nth frame picture, theta represents all parameters of the network,

frequency of the label representing the physiological signal, f _s Representing the video frame rate.

Further, the frequency corresponding to the peak f of the power spectrum is the frequency of the heart rate, and the heart rate is 60 × f.

In one embodiment, a non-contact heart rate measurement device fusing the ConvLSTM and CBAM attention mechanisms is provided, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the non-contact heart rate measurement method fusing the ConvLSTM and CBAM attention mechanisms when executing the computer program.

In one embodiment, a computer readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described method of non-contact heart rate measurement incorporating the ConvLSTM and CBAM attention mechanisms.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that the above description of the embodiments is only for the purpose of assisting understanding of the method of the present application and the core idea thereof, and that those skilled in the art can make several improvements and modifications to the present application without departing from the principle of the present application, and these improvements and modifications are also within the protection scope of the claims of the present application.

Claims

1. A non-contact heart rate measurement method fusing a ConvLSTM and CBAM attention mechanism is characterized by comprising the following specific steps:

constructing a network model fusing a ConvLSTM and CBAM attention mechanism;

filtering the prediction result by using a Butterworth band-pass filter;

2. The method for non-contact heart rate measurement fusing the ConvLSTM and CBAM attention mechanisms as claimed in claim 1, wherein the preprocessing step is as follows:

1.2, resampling the physiological signal label by taking the human face video frame rate as a sampling rate to obtain a physiological signal label sequence P ═ { P ═ ₀ ,p ₁ ,…,p _t ,…,p _T-1 }；

3. The method for non-contact heart rate measurement fusing the ConvLSTM and CBAM attention mechanisms as claimed in claim 1, wherein the network model fusing the ConvLSTM and CBAM attention mechanisms sequentially comprises: the system comprises a first volume layer, a first ConvLSTM layer, N cascaded volume blocks fused with a CBAM attention mechanism, a second ConvLSTM layer, a full connection layer and a Dropout layer; the convolution block fused with the CBAM attention mechanism sequentially comprises a second convolution layer, a first BN layer, a first activation function layer, a third convolution layer, a second BN layer, a second activation function layer and a CBAM layer.

4. The ConvLSTM and CBAM attention-directed non-contact heart rate measurement method of claim 3, wherein the first convolutional layer comprises a two-dimensional convolutional layer with convolutional kernel size [7,7] and a Maxpool layer.

5. The method for non-contact heart rate measurement fusing the ConvLSTM and CBAM attention mechanisms as claimed in claim 3, wherein the negative Pearson correlation coefficient is selected as the loss function of the network model:

6. The method for non-contact heart rate measurement fusing the ConvLSTM and CBAM attention mechanisms as claimed in claim 3, wherein the power spectral density of the filtered rPPG signal is calculated by the formula:

frequency of the label representing the physiological signal, f _s Representing the human face video frame rate.

7. A contactless heart rate measurement device incorporating the ConvLSTM and CBAM attention mechanisms, comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the contactless heart rate measurement method incorporating the ConvLSTM and CBAM attention mechanisms as claimed in any one of claims 1 to 6.

8. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method for contactless heart rate measurement fusing the ConvLSTM and CBAM attention mechanisms of any of the claims 1 to 6.