CN116385837A - Self-supervision pre-training method for remote physiological measurement based on mask self-encoder - Google Patents

Self-supervision pre-training method for remote physiological measurement based on mask self-encoder Download PDF

Info

Publication number
CN116385837A
CN116385837A CN202310445533.XA CN202310445533A CN116385837A CN 116385837 A CN116385837 A CN 116385837A CN 202310445533 A CN202310445533 A CN 202310445533A CN 116385837 A CN116385837 A CN 116385837A
Authority
CN
China
Prior art keywords
space
vit
encoder
self
time diagram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310445533.XA
Other languages
Chinese (zh)
Other versions
CN116385837B (en
Inventor
刘鑫
张雨婷
余梓彤
岳焕景
杨敬钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310445533.XA priority Critical patent/CN116385837B/en
Publication of CN116385837A publication Critical patent/CN116385837A/en
Application granted granted Critical
Publication of CN116385837B publication Critical patent/CN116385837B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06V10/7753Incorporation of unlabelled data, e.g. multiple instance learning [MIL]
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/02Detecting, measuring or recording pulse, heart rate, blood pressure or blood flow; Combined pulse/heart-rate/blood pressure determination; Evaluating a cardiovascular condition not otherwise provided for, e.g. using combinations of techniques provided for in this group with electrocardiography or electroauscultation; Heart catheters for measuring blood pressure
    • A61B5/024Detecting, measuring or recording pulse rate or heart rate
    • A61B5/02416Detecting, measuring or recording pulse rate or heart rate using photoplethysmograph signals, e.g. generated by infrared radiation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Cardiology (AREA)
  • Human Computer Interaction (AREA)
  • Physiology (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Geometry (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Pathology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Apparatus For Radiation Diagnosis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a self-supervision pre-training method for remote physiological measurement based on a mask self-encoder, belonging to the technical field of computer vision; the invention proposes an rPPG-MAE which takes ST-Map as input and uses a Mask Automatic Encoder (MAE) for self-supervised ViT pre-training. To our knowledge, this is the first time exploring self-supervised learning using ST-Map input on challenging rpg tasks, such as VIPL-HR datasets that are less constrained. The invention designs a new rPPG loss function to restrict MAE pre-training task. The proposed rpg loss is more suitable for pre-training than the original pixel reconstruction loss employed in the original MAE, enabling ViT to learn the periodic information of the rpg signal efficiently. In addition to the original ST-Map, the present invention explores several rpg task-related reconstruction targets. The ST-Map with band-pass filtering is provided, the frequency is limited in the range of the heart rate signal, and the network is helped to learn useful period information.

Description

Self-supervision pre-training method for remote physiological measurement based on mask self-encoder
Technical Field
The invention relates to the technical field of computer vision, in particular to a self-supervision pre-training method for remote physiological measurement based on a mask self-encoder.
Background
Heart Rate (HR), heart Rate Variability (HRV), and respiratory Rate (RF) contain a number of important indicators of vital information about the human body. In the past, these physiological signals were typically measured by Electrocardiography (ECG) and photoplethysmography (PPG). However, these conventional methods require direct contact with the body, limiting the ability to monitor human vital information in real time in a sensorless environment. Contactless remote heart rate monitoring (rpg) has become a hot topic of research by analyzing skin color changes in patient facial videos without additional sensors.
In the early stages, many methods explored various manual properties of rpg. In recent years, a number of end-to-end supervision models have been designed using two-dimensional/three-dimensional Convolutional Neural Networks (CNNs) to extract rpg features. Meanwhile, some research work developed some non-end-to-end fully supervised methods to capture the rpg signal from the space-time diagram (ST-Map). However, supervised learning requires a large amount of labeling data, and in the rpg field, the cost of collecting large-scale labeling data with accuracy is high. Thus, some self-monitoring methods have been proposed to cope with this limitation, for example Gideo and stept propose a method with weak prior to the frequency and time smoothness of the target signal; sun and Li generate multiple rPPG signals from each video at different spatio-temporal positions using a 3DCNN model, taking both facial video frames as input and obtaining an rPPG representation, directly predicting the rPPG signals. However, these methods are end-to-end and may not be robust in challenging situations (e.g., severe head movements).
Since the rpg signal is very subtle, it is easily swamped by noise (e.g., light, motion, camera noise, etc.), and it is difficult to extract periodic information from the original video data in the manner of an original data structure. That is why many successful rpg methods still construct neural network inputs in a specific way, rather than directly using raw data, such as a spatiotemporal Map (ST-Map), where spatiotemporal representations and temporal physiological signals extracted from different regions of interest (ROIs) of the face are designed as inputs to a model. On the one hand, ST-Map contains abundant physiological information and has been successfully applied to supervised learning methods. On the other hand, the cost of acquiring PPG/ECG signals while acquiring large-scale face video data is high.
In recent years, self-supervised learning has become a hotspot in the field of computer vision, and many methods have been proposed, such as self-supervised learning algorithms based on auxiliary tasks and models based on contrast learning. Today, a more versatile denoising auto-encoder has enjoyed tremendous success in both Natural Language Processing (NLP) (e.g., mask auto-encoding of BERT) and computer vision (e.g., mask auto-encoder (MAE)). In particular, MAEs have proven to be an effective image analysis task (such as image classification and object segmentation). rpg is a typical computer vision task, and how to use a mask self-encoder to reduce information redundancy and noise of ST-Map, to realize efficient rpg measurement, naturally becomes the focus of research. However, in early studies, the mask auto-encoder was used only to pre-train natural images, such as ImageNet datasets. In addition, a large gap exists between the natural semantic image and the ST-Map.
In order to solve the problems, the invention provides a self-supervision pre-training method for remote physiological measurement based on a mask self-encoder.
Disclosure of Invention
The invention aims to provide a self-supervision pre-training method for remote physiological measurement based on a mask self-encoder so as to solve the following problems in the prior art:
(1) The mask automatic encoder is only used for pre-training natural images, and has a narrow application range;
(2) There is a large gap between the natural semantic image and the ST-Map:
2.1 The natural semantic image is different from the physical information contained in the ST-Map;
2.2 Identifying valid information from ST-Map is more difficult than from natural semantic images;
2.3 The proposed goal of self-supervising pre-training is quite different from existing work.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the self-supervision pre-training method for remote physiological measurement based on the mask self-encoder utilizes the advantages of the mask self-encoder on a space-time diagram and training ViT to design a novel shielding self-supervision rPPG measurement method, which comprises the following specific contents:
step 1, detecting a face video by using open source face detection software Seeta face, positioning 81 face key points, generating a face boundary frame by using the 81 face key points, and aligning a face region and removing a background region through the face boundary frame;
step 2, dividing the face video frame which is obtained in the step 1 and has the background area removed into 25 interested areas (ROI), and respectively calculating the average pixel value of each color channel (R, G, B) in each area; the average color values of each channel of the same block but different frames are connected in series to form a sequence, the sequences from the same color channel are spliced to form pictures, and then a large space-time diagram is generated through a face video
Figure SMS_1
Step 3, cutting and adjusting the large space-time diagram (ST-Map) obtained in the step 2 to obtain a square space-time diagram;
step 4, carrying out mask processing on the square space-time diagram obtained in the step 3, and calculating to obtain a reserved space-time diagram patch;
step 5, inputting the reserved space-time diagram patch into a ViT encoder to generate a coded space-time diagram feature vector;
step 6, the space-time diagram feature vector and the mask mark are input into a ViT decoder together, and after passing through the ViT decoder, missing patches of the ST-Map are obtained through prediction;
step 7, calculating reconstruction loss of pixel values of the predicted patch and the corresponding position of the original space-time diagram, and training and optimizing a new loss function;
step 8, acquiring a trained ViT encoder based on the step 7, and inputting an unmasked space-time diagram into the current ViT encoder to generate a complete space-time diagram feature vector;
step 9, inputting the complete feature vector into an rPPG predictor, and outputting a predicted rPPG signal;
step 10, training ViT encoder and rpg predictor based on steps 8, 9;
and 11, inputting a space-time diagram into a trained ViT coder and an rPPG predictor to obtain a prediction result.
Preferably, the cropping of the large space-time diagram in step 3 specifically includes the following: cutting a large space-time diagram into small space-time diagrams with fixed overlapping step length (s=5), controlling the cutting length to be 224, and readjusting the obtained rectangular space-time diagram (224×25) into 224×224 square space-time diagram
Figure SMS_2
Preferably, the step 4 specifically includes the following:
step 4.1, dividing the space-time diagram into non-overlapping patches (size
Figure SMS_3
,/>
Figure SMS_4
);
Step 4.2, shuffling the patch obtained in the step 1;
and 4.3, keeping the proportion of the patches orderly, removing the remaining patches, and calculating the number of the retained patches, wherein a specific calculation formula is as follows:
Figure SMS_5
wherein ,R m a specific mask ratio representing the masking process,R m =75%;Ta length/width (space-time diagram is square) representing a space-time diagram, t=224;
Figure SMS_6
represents the length/width of the patch (patch is square),>
Figure SMS_7
preferably, the ViT encoder in step 5 includes a linear mapping layer with position coding and a plurality of transducer modules; the invention selects ViT basic version, which comprises 12 transducer modules, and the output dimension is 768. The input at this stage is the reserved patch in step 4
Figure SMS_8
The output of the ViT encoder is:
Figure SMS_9
wherein ,
Figure SMS_10
L k andD e representing the length of the input ST-Map sequence and the ViT encoded dimension, respectively; />
Figure SMS_11
Representing input patch data,/->
Figure SMS_12
A ViT encoder is shown.
Preferably, the ViT decoder described in step 6The device comprises 8 transducer modules, and the output dimension is 128. Output length after passing ViT decoder due to addition of mask mark
Figure SMS_13
The specific formula for the number of patches in the whole ST-Map is as follows:
Figure SMS_14
Figure SMS_15
wherein ,L all representing the length of the entire ST-Map sequence;D d representing the output dimension of the ViT decoder;
Figure SMS_16
output as ViT encoder; />
Figure SMS_17
Representing ViT decoder. The default dimension does not match the number of pixel values in the patch, so the last layer of the ViT decoder designs a linear projection, and the mask mark is remodelled into the patch to obtain the required reconstructed ST-Map.
Preferably, the output of the ViT decoder is a series of vectors having dimensions equal to the number of pixels of a patch, the pixel loss function only calculating the Mean Square Error (MSE) between the reconstructed image and the original image in the mask pixel space, in particular:
Figure SMS_18
wherein ,
Figure SMS_19
a mask pixel value representing ViT decoder prediction; />
Figure SMS_20
Mask pixel values representing ST-Map realityThe method comprises the steps of carrying out a first treatment on the surface of the MSE (·) represents the mean square error;
the reconstruction loss described in step 7 specifically refers to: the ViT encoder is guaranteed to learn the periodic characteristics of the BVP signal by reconstructing a new ST-Map, the specific function being expressed as:
Figure SMS_21
wherein ,
Figure SMS_22
,/>
Figure SMS_23
representing pixel values of a reconstructed ST-Map and a true ST-Map line, respectively;PC(. Cndot.) represents pearson correlation;CandN ROI the number of channels and the number of ROIs, respectively, wherein,N ROI =T;
in summary, the overall loss function of the reconstruction phase is:
Figure SMS_24
wherein the super parameterλ∈{0,1}。
Preferably, the input to the trained ViT encoder described in step 8 is ST-Map
Figure SMS_25
Is a complete patch of (a); the output of the trained ViT encoder is:
Figure SMS_26
wherein ,
Figure SMS_27
L all andD e the length of the entire ST-Map sequence and the dimension of the ViT encoder are shown, respectively.
Preferably, the rpg predictor described in step 9 consists of one simple Linear layer (Linear) and layer normalization (LayerNorm).
Preferably, the step 10 specifically includes the following:
step 10.1, predicting the rpg signal by selecting a negative pearson correlation loss calculated between the predicted rpg signal and the real BVP signal, in particular:
Figure SMS_28
wherein ,S pr andS gt representing the predicted rpg signal and the actual BVP signal, respectively;PC(. Cndot.) represents pearson correlation;
step 10.2, performing better prediction by using frequency domain loss, and calculating a cross entropy error between a real heart rate and an estimated rPPG signal spectrum distribution, wherein the cross entropy error specifically comprises the following steps:
Figure SMS_29
wherein ,PSD(·) represents the power spectral density of the predicted rpg signal;CE(. Cndot.) represents cross entropy loss;
Figure SMS_30
refer to the true heart rate, and is specifically expressed as a single heat vector hr= [0, …,0,1,0, …]"1" represents an index corresponding to the true heart rate;
Figure SMS_31
representing the predicted signal.
Step 10.3, synthesize the content of step 10.1-10.2, the whole loss function of the rPPG predictive stage is specifically:
Figure SMS_32
wherein the parameter gamma e {0,1} is adjusted between different datasets, in the present invention we set gamma=0 in VIPL-HR datasets and gamma=1 in both the push and UBFC-rpg datasets.
Compared with the prior art, the invention provides a self-supervision pre-training method for remote physiological measurement based on a mask self-encoder, which has the following beneficial effects:
(1) The invention proposes an rPPG-MAE which takes ST-Map as input and uses a Mask Automatic Encoder (MAE) for self-supervised ViT pre-training. This is the first time on the rpg task, exploring the use of self-supervised learning with ST-Map as input on the challenging VIPL-HR dataset.
(2) The invention designs a new rPPG loss function to restrict MAE pre-training task. The proposed rpg loss is more suitable for pre-training than the original pixel reconstruction loss employed in the original MAE, enabling ViT to learn the periodic information of the rpg signal efficiently.
(3) In addition to the original ST-Map, the present invention explores several rpg task-related reconstruction targets. The ST-Map with band-pass filtering is proposed, the frequency is limited in the BVP signal range, and the network is helped to learn useful periodic information.
(4) The method is an unsupervised method, does not need expensive manual labeling of the data set, has better economical efficiency compared with other methods, and is worth popularizing.
(5) The invention has wider application range, can be expanded to other monitoring methods, and further improves the performance.
Drawings
FIG. 1 is a schematic diagram of the time-space diagram (ST-Map) generation of a self-supervised pre-training method for remote physiological measurement based on a mask self-encoder;
FIG. 2 is a flow chart of a design framework of a self-monitoring pre-training method for performing remote physiological measurement based on a mask self-encoder according to the present invention;
FIG. 3 is an input original graph, a mask effect graph and a reconstruction effect graph in embodiment 1 of the present invention;
fig. 4 is a graph comparing the predicted rpg signal with the actual BVP signal in example 1 of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
The invention provides a self-supervision pre-training method for remote physiological measurement based on a mask self-encoder, which is sponsored by 'national natural science foundation-human body micro-gesture recognition and emotion analysis project 62171309 based on self-supervision learning', and mainly aims to solve the following problems in the prior art:
rpg is a typical computer vision task, and how to use a mask self-encoder to reduce information redundancy and noise of ST-Map, to realize efficient rpg measurement, naturally becomes the focus of research. However, in early studies, the mask auto-encoder was used only to pre-train natural images, such as ImageNet datasets. In addition, a large gap exists between the natural semantic image and the ST-Map:
1) The two images contain different physical information. The natural image contains only spatial information, where a cluster of pixels represents one object, but the ST-Map is a representation of the physiological signal in the spatial and temporal domains.
2) Identifying valid information from ST-Map is more difficult than from natural images. The extraction of the rpg signal from ST-Map is relatively difficult because of the presence of many uncorrelated noise and subtle physiological signals in ST-Map.
3) The proposed goal of self-supervising pre-training is quite different from existing work. The main purpose of self-supervised pre-training in rpg is not to predict the mask pixel values for reconstructing the image, but rather to predict an image containing similar physical period information as the real ST-Map.
In response to the above problems, the present invention proposes an rpg-MAE that uses ST-Map as input, and uses a Mask Auto Encoder (MAE) for self-supervised ViT pre-training. To our knowledge, this is the first time exploring self-supervised learning using ST-Map input on challenging rpg tasks, such as VIPL-HR datasets that are less constrained. Meanwhile, the invention designs a new rPPG loss function to restrict the MAE pre-training task. The proposed rpg loss is more suitable for pre-training than the original pixel reconstruction loss employed in the original MAE, enabling ViT to learn the periodic information of the rpg signal efficiently. In addition to the original ST-Map, the present invention explores several rpg task-related reconstruction targets. The ST-Map with band-pass filtering is proposed, the frequency is limited in the BVP signal range, and the network is helped to learn useful periodic information. In addition, the invention is an unsupervised method, which does not require expensive manual labeling of the data sets and is cheaper than other methods. Furthermore, the invention can be expanded to other monitoring methods to further improve the performance.
Based on the above description, the self-supervision pre-training method for remote physiological measurement based on the mask self-encoder provided by the invention specifically comprises the following steps:
example 1:
the invention provides a self-supervision pre-training method for remote physiological measurement based on a mask self-encoder, referring to fig. 1, which comprises four ST-Map generation schematic diagrams:
we first align the face in different frames according to the detected keypoints, and then divide the face region into n ROI blocks R1, R2, … R25. An average color value is calculated for each color channel in each block. The average color values per channel of the same block but different frames are concatenated into a sequence, i.e. R1, G1, B1, R2, G2, B2, …, R25, G25, B25. Splice sequences from the same color channel into a size of
Figure SMS_33
(R, G, B), where n=25. Further, the CHROM signal and the POS signal are processed using a CHROM algorithm and a POS algorithm. The filtered signal in the figure +.>
Figure SMS_34
With a pass frequency of [0.6, 3]Is filtered by a butterworth band-pass filter. Finally, the different combined signals are spliced into 4 ST-Maps (CHROM-ST-Map, POS-ST-Map, filter-ST-Map, ST-Map)
The overall design flow of the invention is shown in fig. 2, and the overall flow can be divided into three modules:
1)ST-Mand an ap generating module. We mark the ith input ST-Map as
Figure SMS_35
Where N represents the number of ROIs, T represents the number of frames of a video segment, and C represents the number of channels (c=3, including R, G and B), as shown in fig. 2, we first generate a large ST-Map from the entire video>
Figure SMS_36
Then, large ST-maps overlapping s frames are cut. The fragment length is T. Therefore, the number of ST-maps in video is +.>
Figure SMS_37
After that, we adjust the size of the original ST-Map to +.>
Figure SMS_38
The number of ROIs increases from N to T.
2) And an ST-Map reconstruction module. The reconstruction module is mainly composed of a ViT encoder and a ViT decoder. The ViT encoder includes a linear mapping layer with position encoding and a number of transducer modules. The invention selects ViT basic version, which comprises 12 transducer modules with 768 output dimension, the input at this stage is the reserved patch in step 4
Figure SMS_40
The output of the ViT encoder is +.>
Figure SMS_44
, wherein />
Figure SMS_47
,/>
Figure SMS_41
and />
Figure SMS_43
The length of the input ST-Map sequence and the ViT encoded dimension are shown, respectively. ViT decoder includes 8 transducer modules with an output dimension of 128. Due to the addition of the masked signature, a ViT solution is passedEncoder post output Length->
Figure SMS_46
The number of patches in the entire ST-Map. The output of the ViT decoder is +.>
Figure SMS_48
,/>
Figure SMS_39
, wherein />
Figure SMS_42
Indicates the length of the entire ST-Map sequence, < >>
Figure SMS_45
Representing the output dimension of the ViT decoder. However, the default dimension does not match the number of pixel values in the patch, so a linear projection is designed at the last layer of the decoder. In this way, the mask mark is remodelled into a patch, and we can then obtain the desired reconstructed ST-Map. The reconstructed ST-Map and the true ST-Map calculate the loss. The output of the ViT decoder is a series of vectors with dimensions equal to the number of pixels of a patch. The pixel loss function calculates only the Mean Square Error (MSE) between the reconstructed image and the original image in the mask pixel space as
Figure SMS_49
wherein ,
Figure SMS_50
mask pixel values predicted for ViT decoder,>
Figure SMS_51
mask pixel values that are true for ST-Map,
Figure SMS_52
is the mean square error.
In order for the ViT encoder to learn the periodicity of the BVP signal, the present invention proposes a new loss function:
Figure SMS_53
wherein
Figure SMS_54
,/>
Figure SMS_55
Representing the pixel values of the reconstructed ST-Map and the real ST-Map line, respectively. />
Figure SMS_56
Representing pearson correlation; />
Figure SMS_57
and />
Figure SMS_58
Respectively the number of channels and the number of ROIs, wherein +.>
Figure SMS_59
.
In summary, the overall loss function of the reconstruction phase can be written as:
Figure SMS_60
wherein the super parameter gamma e is {0,1}.
3) rPPG prediction module: this module consists of a ViT encoder and an rpg predictor. The initialization weights of the ViT encoder in this module are pre-trained during ST-Map reconstruction. The rpg predictor consists of a Linear layer (Linear) and layer normalization (LayerNorm). The original ST-Map is input into a ViT coder and then input into an rPPG predictor, and the output of the rPPG predictor is a predicted rPPG signal. The predicted rpg signal and the true BVP signal calculate the loss:
the negative pearson correlation loss calculated between the predicted rpg signal and the real BVP signal can be expressed as:
Figure SMS_61
wherein ,
Figure SMS_62
and />
Figure SMS_63
Respectively representing the predicted rpg signal and the actual BVP signal. />
Figure SMS_64
Representing pearson correlation.
In addition, the frequency domain loss is utilized to perform better prediction, and the cross entropy error between the true heart rate and the estimated rPPG signal spectrum distribution is calculated as follows:
Figure SMS_65
wherein
Figure SMS_66
Representing the power spectral density of the predicted rpg signal, < >>
Figure SMS_67
Representing cross entropy loss. />
Figure SMS_68
The true heart rate can be calculated using a single heat vector hr= [0, …,0,1,0, …]"1" represents an index corresponding to the true heart rate; />
Figure SMS_69
Representing the predicted signal.
In general, the overall loss function of the rpg prediction phase can be written as:
Figure SMS_70
where the parameter gamma e 0,1 will be adjusted between different data sets. In the present invention we set γ=0 in the VIPL-HR dataset and γ=1 in the PURE and UBFC-rpg datasets.
Overall, the main steps are divided into three major steps: 1) Generating ST-Map; 2) Reconstructing ST-Map; 3) The rpg signal is predicted. The first step is to prepare input data for two subsequent parts, reconstruct ST-Map to pre-train ViT encoder weight parameters for predicting rPPG signals, and finally obtain the required rPPG signals by an rPPG prediction module.
FIG. 3 is a visual representation of the step of reconstructing the ST-Map, the reconstructed ST-Map being as close as possible to the original ST-Map.
Fig. 4 is a visual representation of the predicted rpg signal, which can be observed to be very close to the real BVP signal.
The present invention is not limited to the above-mentioned embodiments, and any person skilled in the art, based on the technical solution of the present invention and the inventive concept thereof, can be replaced or changed within the scope of the present invention.

Claims (9)

1. The self-supervision pre-training method for remote physiological measurement based on the mask self-encoder is characterized in that a novel mask self-supervision rPPG measurement method is designed by utilizing the advantages of the mask self-encoder on a space-time diagram and training ViT, and specifically comprises the following steps:
step 1, detecting a face video by using face detection software, positioning face key points in the video, generating a face boundary frame by using the face key points, and aligning a face region and removing a background region by using the face boundary frame;
step 2, dividing the face video frame which is obtained in the step 1 and has been aligned and the background area removed into a plurality of interested areas, and respectively calculating the average pixel value of each color channel in each area; the average color values of each channel of the same block but different frames are connected in series to form a sequence, and the sequences from the same color channel are spliced to form pictures, so that a large space-time diagram is generated through a face video;
step 3, cutting and adjusting the large time-space diagram obtained in the step 2 to obtain a square time-space diagram;
step 4, carrying out mask processing on the square space-time diagram obtained in the step 3, and calculating to obtain a reserved space-time diagram patch;
step 5, inputting the reserved space-time diagram patch into a ViT encoder to generate a coded space-time diagram feature vector;
step 6, the space-time diagram feature vector and the mask mark are input into a ViT decoder together, and after the space-time diagram feature vector and the mask mark pass through the ViT decoder, a missing patch of the space-time diagram is obtained through prediction;
step 7, calculating reconstruction loss of pixel values of the predicted patch and the corresponding position of the original space-time diagram, and training and optimizing a new loss function;
step 8, acquiring a trained ViT encoder based on the step 7, and inputting an unmasked space-time diagram into the current ViT encoder to generate a complete space-time diagram feature vector;
step 9, inputting the complete feature vector into an rPPG predictor, and outputting a predicted rPPG signal;
step 10, training ViT encoder and rpg predictor based on steps 8, 9;
and 11, inputting a space-time diagram into a trained ViT coder and an rPPG predictor to obtain a prediction result.
2. The method for self-monitoring pre-training based on remote physiological measurement by mask self-encoder according to claim 1, wherein the clipping of the large space-time diagram in step 3 specifically comprises the following steps: a large space-time diagram is cut into small space-time diagrams with fixed overlapping step sizes, the cutting time is controlled to be 224, and the obtained rectangular space-time diagram is readjusted into 224×224 square space-time diagrams.
3. The method for self-monitoring pre-training based on remote physiological measurement by mask self-encoder according to claim 1, wherein the step 4 specifically comprises the following steps:
step 4.1, dividing the space-time diagram into non-overlapping patches;
step 4.2, shuffling the patch obtained in the step 1;
and 4.3, keeping the proportion of the patches orderly, removing the remaining patches, and calculating the number of the retained patches, wherein a specific calculation formula is as follows:
Figure QLYQS_1
wherein ,R m a specific mask ratio representing the masking process,Trepresenting the length/width of the space-time diagram,
Figure QLYQS_2
representing the length/width of the patch.
4. The method of claim 1, wherein the ViT encoder in step 5 comprises a linear mapping layer with position coding and a plurality of transducer modules; the output of the ViT encoder is:
Figure QLYQS_3
wherein ,
Figure QLYQS_4
L k andD e representing the length of the input space-time diagram sequence and the ViT encoded dimension, respectively; />
Figure QLYQS_5
Representing input patch data,/->
Figure QLYQS_6
A ViT encoder is shown.
5. The method for self-monitoring pre-training based on remote physiological measurement by mask self-encoder according to claim 1, wherein the output length after ViT decoder in step 6 is the number of patches in the whole space-time diagram, and the specific formula is:
Figure QLYQS_7
Figure QLYQS_8
wherein ,L all representing the length of the entire time space diagram sequence;D d representing the output dimension of the ViT decoder;
Figure QLYQS_9
output as ViT encoder; />
Figure QLYQS_10
Representing ViT decoder; the final layer of the ViT decoder is designed with linear projections, and the mask marks are remodelled into patches, thereby obtaining the required reconstructed time-space diagram.
6. The method according to claim 1, wherein the output of the ViT decoder is a series of vectors with dimensions equal to the number of pixels of a patch, and the pixel loss function only calculates the mean square error between the reconstructed image and the original image in the mask pixel space, in particular:
Figure QLYQS_11
wherein ,
Figure QLYQS_12
a mask pixel value representing ViT decoder prediction; />
Figure QLYQS_13
Mask pixel values representing the reality of the space-time diagram; MSE (·) represents the mean square error;
the reconstruction loss described in step 7 specifically refers to: the ViT encoder is guaranteed to learn the periodic characteristics of the BVP signal by reconstructing a new space-time diagram, the specific function being expressed as:
Figure QLYQS_14
wherein ,
Figure QLYQS_15
,/>
Figure QLYQS_16
respectively representing pixel values of one row of the reconstructed space-time diagram and the real space-time diagram;PC(. Cndot.) represents pearson correlation;CandN ROI the number of channels and the number of ROIs, respectively, wherein,N ROI =T;
in summary, the overall loss function of the reconstruction phase is:
Figure QLYQS_17
wherein the super parameterλ∈{0,1}。
7. The method of claim 1, wherein the input to the trained ViT encoder in step 8 is a complete patch of a space-time diagram; the output of the trained ViT encoder is:
Figure QLYQS_18
wherein ,
Figure QLYQS_19
L all andD e representing the length of the entire ST-Map sequence and the dimension of the ViT encoder, respectively; />
Figure QLYQS_20
Representing input complete data; />
Figure QLYQS_21
Representing a pre-trained ViT encoder.
8. The method of claim 1, wherein the rpg predictor in step 9 consists of a simple linear layer and layer normalization.
9. The method for self-monitoring pre-training based on remote physiological measurement by mask self-encoder according to claim 1, wherein the following are specifically included in the step 10:
step 10.1, predicting the rpg signal by selecting a negative pearson correlation loss calculated between the predicted rpg signal and the real BVP signal, in particular:
Figure QLYQS_22
wherein ,S pr andS gt representing the predicted rpg signal and the actual BVP signal, respectively;PC(. Cndot.) represents pearson correlation;
step 10.2, performing better prediction by using frequency domain loss, and calculating a cross entropy error between a real heart rate and an estimated rPPG signal spectrum distribution, wherein the cross entropy error specifically comprises the following steps:
Figure QLYQS_23
wherein ,PSD(·) represents the power spectral density of the predicted rpg signal;CE(. Cndot.) represents cross entropy loss;
Figure QLYQS_24
refer to the true heart rate, and is specifically expressed as a single heat vector hr= [0, …,0,1,0, …]"1" represents an index corresponding to the true heart rate; />
Figure QLYQS_25
Representing the predicted signal;
step 10.3, synthesize the content of step 10.1-10.2, the whole loss function of the rPPG predictive stage is specifically:
Figure QLYQS_26
wherein the parameter gamma e {0,1} is adjusted between different data sets.
CN202310445533.XA 2023-04-24 2023-04-24 Self-supervision pre-training method for remote physiological measurement based on mask self-encoder Active CN116385837B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310445533.XA CN116385837B (en) 2023-04-24 2023-04-24 Self-supervision pre-training method for remote physiological measurement based on mask self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310445533.XA CN116385837B (en) 2023-04-24 2023-04-24 Self-supervision pre-training method for remote physiological measurement based on mask self-encoder

Publications (2)

Publication Number Publication Date
CN116385837A true CN116385837A (en) 2023-07-04
CN116385837B CN116385837B (en) 2023-09-08

Family

ID=86967482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310445533.XA Active CN116385837B (en) 2023-04-24 2023-04-24 Self-supervision pre-training method for remote physiological measurement based on mask self-encoder

Country Status (1)

Country Link
CN (1) CN116385837B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580612A (en) * 2021-02-22 2021-03-30 中国科学院自动化研究所 Physiological signal prediction method
US20210224983A1 (en) * 2018-05-16 2021-07-22 Mitsubishi Electric Research Laboratories, Inc. System and Method for Remote Measurements of Vital Signs of a Person in a Volatile Environment
CN113343821A (en) * 2021-05-31 2021-09-03 合肥工业大学 Non-contact heart rate measurement method based on space-time attention network and input optimization
CN114821439A (en) * 2022-05-10 2022-07-29 合肥中聚源智能科技有限公司 Token learning-based face video heart rate estimation system and method
CN114912487A (en) * 2022-05-10 2022-08-16 合肥中聚源智能科技有限公司 End-to-end remote heart rate detection method based on channel enhanced space-time attention network
CN115024706A (en) * 2022-05-16 2022-09-09 南京邮电大学 Non-contact heart rate measurement method integrating ConvLSTM and CBAM attention mechanism
CN115311728A (en) * 2022-09-06 2022-11-08 杭州登虹科技有限公司 ViT network-based multi-stage training method for face key point detection model
CN115331073A (en) * 2022-07-26 2022-11-11 华中师范大学 Image self-supervision learning method based on TransUnnet architecture
CN115497143A (en) * 2022-10-10 2022-12-20 南京大学 Non-contact multi-modal physiological signal detection method based on self-supervision and lifelong learning
CN115590515A (en) * 2022-09-28 2023-01-13 上海零唯一思科技有限公司(Cn) Emotion recognition method and system based on generative self-supervision learning and electroencephalogram signals
CN115813408A (en) * 2022-11-25 2023-03-21 华中科技大学 Self-supervision learning method of transform encoder for electroencephalogram signal classification task
CN115841143A (en) * 2021-09-20 2023-03-24 辉达公司 Joint estimation of heart rate and respiration rate using neural networks

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210224983A1 (en) * 2018-05-16 2021-07-22 Mitsubishi Electric Research Laboratories, Inc. System and Method for Remote Measurements of Vital Signs of a Person in a Volatile Environment
US11227161B1 (en) * 2021-02-22 2022-01-18 Institute Of Automation, Chinese Academy Of Sciences Physiological signal prediction method
CN112580612A (en) * 2021-02-22 2021-03-30 中国科学院自动化研究所 Physiological signal prediction method
CN113343821A (en) * 2021-05-31 2021-09-03 合肥工业大学 Non-contact heart rate measurement method based on space-time attention network and input optimization
CN115841143A (en) * 2021-09-20 2023-03-24 辉达公司 Joint estimation of heart rate and respiration rate using neural networks
CN114821439A (en) * 2022-05-10 2022-07-29 合肥中聚源智能科技有限公司 Token learning-based face video heart rate estimation system and method
CN114912487A (en) * 2022-05-10 2022-08-16 合肥中聚源智能科技有限公司 End-to-end remote heart rate detection method based on channel enhanced space-time attention network
CN115024706A (en) * 2022-05-16 2022-09-09 南京邮电大学 Non-contact heart rate measurement method integrating ConvLSTM and CBAM attention mechanism
CN115331073A (en) * 2022-07-26 2022-11-11 华中师范大学 Image self-supervision learning method based on TransUnnet architecture
CN115311728A (en) * 2022-09-06 2022-11-08 杭州登虹科技有限公司 ViT network-based multi-stage training method for face key point detection model
CN115590515A (en) * 2022-09-28 2023-01-13 上海零唯一思科技有限公司(Cn) Emotion recognition method and system based on generative self-supervision learning and electroencephalogram signals
CN115497143A (en) * 2022-10-10 2022-12-20 南京大学 Non-contact multi-modal physiological signal detection method based on self-supervision and lifelong learning
CN115813408A (en) * 2022-11-25 2023-03-21 华中科技大学 Self-supervision learning method of transform encoder for electroencephalogram signal classification task

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HE, KAIMING ET AL: "Masked Autoencoders Are Scalable Vision Learners", ARXIV *
JAISWAL KOKILA BHARTI ET AL: "Heart rate estimation network from facial videos using spatiotemporal feature image", COMPUTERS IN BIOLOGY AND MEDICINE *
赵昶辰等: "面向远程光体积描记的人脸检测与跟踪", 中国图象图形学报 *

Also Published As

Publication number Publication date
CN116385837B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
Khor et al. Enriched long-term recurrent convolutional network for facial micro-expression recognition
CN110991281B (en) Dynamic face recognition method
CN112819910B (en) Hyperspectral image reconstruction method based on double-ghost attention machine mechanism network
CN110969124A (en) Two-dimensional human body posture estimation method and system based on lightweight multi-branch network
Chen et al. End-to-end learning of object motion estimation from retinal events for event-based object tracking
CN113283444B (en) Heterogeneous image migration method based on generation countermeasure network
CN111210415B (en) Method for detecting facial expression hypo of Parkinson patient
Lin et al. Motion-aware feature enhancement network for video prediction
CN106203255A (en) A kind of pedestrian based on time unifying heavily recognition methods and system
CN114912487A (en) End-to-end remote heart rate detection method based on channel enhanced space-time attention network
Liu et al. rPPG-MAE: Self-supervised pretraining with masked autoencoders for remote physiological measurements
CN113420703A (en) Dynamic facial expression recognition method based on multi-scale feature extraction and multi-attention mechanism modeling
CN115797827A (en) ViT human body behavior identification method based on double-current network architecture
Das et al. Bvpnet: Video-to-bvp signal prediction for remote heart rate estimation
CN114972426A (en) Single-target tracking method based on attention and convolution
CN111626296A (en) Medical image segmentation system, method and terminal based on deep neural network
CN117474817B (en) Method for content unification of composite continuous images
CN112906675A (en) Unsupervised human body key point detection method and system in fixed scene
CN113657200A (en) Video behavior action identification method and system based on mask R-CNN
CN116385837B (en) Self-supervision pre-training method for remote physiological measurement based on mask self-encoder
CN112487926A (en) Scenic spot feeding behavior identification method based on space-time diagram convolutional network
CN117011357A (en) Human body depth estimation method and system based on 3D motion flow and normal map constraint
CN104952053B (en) The facial image super-resolution reconstructing method perceived based on non-linear compression
CN111950496B (en) Mask person identity recognition method
Verma et al. Intensifying security with smart video surveillance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant