CN115374815A

CN115374815A - Automatic sleep staging method based on visual Transformer

Info

Publication number: CN115374815A
Application number: CN202210965248.6A
Authority: CN
Inventors: 任延珍; 彭荔
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2022-11-22

Abstract

The invention discloses an automatic sleep staging method based on a visual Transformer. The method comprises the steps of processing an original PSG signal through a sliding window to obtain a PSG signal sequence; carrying out data enhancement on the PSG signal sequence to obtain an enhanced signal sample; establishing a sleep staging network by cascading a visual Transformer frame-level encoder, a bidirectional GRU sequence-level encoder and a softmax layer, inputting each group of PSG signal samples into the sleep staging network, predicting the sleep stage of the PSG signal samples, initializing the network by cross-modal transfer learning, establishing a loss function by combining the real sleep stage of the PSG signal samples, and training by using an ADAM optimizer to obtain an optimized sleep staging network; and (4) acquiring a PSG signal in real time, and predicting a sleep stage by passing a PSG signal sample through an optimized sleep stage network. According to the invention, data enhancement is designed according to the noise and the artifact of the PSG signal, so that the robustness of the network to the noise and the artifact of the PSG signal is improved; an encoder based on a visual Transformer is introduced to improve the network feature representation capability; through transfer learning, dependence on a large amount of PSG data is relieved.

Description

Automatic sleep staging method based on visual Transformer

Technical Field

The invention belongs to the technical field of sleep quality assessment, and particularly relates to an automatic sleep staging method based on a visual Transformer.

Background

Polysomnography (PSG) is a standard technology clinically used for sleep state monitoring, comprehensively records various physiological indexes of a monitored object in a sleep process, including nerve signals such as electroencephalogram (EEG), electrooculogram (EOG) and Electromyogram (EMG) and respiratory monitoring data such as oral-nasal airflow, chest and abdominal pressure and blood oxygen saturation concentration, and can be used as an effective basis for evaluating sleep quality and diagnosing sleep disorder symptoms. However, sleep physiological signal analysis for a long time depends on sleep experts to manually check polysomnography signals, the method has the problems of low efficiency, high labor cost and the like, and subjective differences of expert knowledge can also cause errors of evaluation results. Therefore, there is a need to develop a robust, high-performance automated sleep staging tool to assist physicians in sleep staging, improving sleep staging efficiency and accuracy.

The automated sleep staging technique is the basis for expanding sleep assessment and diagnosis, serving millions of people with sleep disorders, and making sleep monitoring possible in a home environment. Although the existing automatic sleep staging model obtains good automatic sleep staging performance and can exceed the staging accuracy of a single human expert, the existing model still has some problems to be solved:

visual transformers can extract valid feature representations, but the performance on PSG signals has not been explored; while the position information on the time and frequency axes is crucial for the fourier transformed PSG signal. However, in the recent sleep staging method, only the position relationship on the time axis is considered, and the visual Transformer can capture the position information on the time axis and the frequency axis simultaneously, so that the defects of the existing model are overcome.

The Transformer-based deep learning approach requires a large amount of training data to surpass the performance of CNN. Current Transformer-based automated sleep staging models perform well when pre-trained on large-scale PSG datasets, but the accuracy of sleep staging decreases significantly on smaller datasets. However, it is difficult to obtain large-scale, accurately labeled PSG data sets, and training the model from scratch on large PSG data sets consumes a significant amount of computational resources.

The existing automatic sleep staging model has low robustness to noise and artifacts. Due to human factors and the influence of an acquisition environment, noise and artifacts which are difficult to avoid exist in the PSG signal. However, research into the design of data enhancement modules for PSG signals is still quite limited. Most work has taken enhancement means directly in image and audio tasks without taking into account the characteristics of the PSG signal itself.

There is therefore a substantial need to develop a robust automatic sleep staging technique based on visual transducers.

Disclosure of Invention

Aiming at the problems of limited performance, scarce reliable PSG data set, poor robustness of a model to PSG signal noise and artifacts and the like of feature representation in a sleep staging task, the invention introduces a visual transform-based encoder, relieves the dependence on a large amount of PSG data through transfer learning, and designs a data enhancement module aiming at the noise and artifacts of the PSG signal, thereby learning the feature representation with high performance and high robustness.

The model comprises three key ideas: designing a frame-level encoder based on a visual Transformer, capturing short-term context information by adopting a sliding window, and realizing long-term sequence-level modeling by using a GRU (generalized regression Unit); through cross-modal transfer learning, a pre-training model on an out-of-domain data set is used for fine tuning on a sleep PSG data set so as to reduce dependence on a large-scale PSG data set; a dynamic data enhancement module for EEG and EOG channels is proposed to enable the model to learn more robust feature representations.

The method is an automatic sleep staging method based on a visual transducer, and comprises the following specific steps:

step 1: introducing original PSG signals of a plurality of channels at a plurality of moments, and processing the original PSG signals of each channel at the plurality of moments through a sliding window to obtain a plurality of PSG signal sequences of each channel;

step 2: obtaining multiple groups of PSG signal samples after data enhancement processing of each channel by carrying out data enhancement processing on multiple PSG signal sequences of each channel, constructing each group of PSG signal samples through the same group of data-enhanced PSG signal samples of the multiple channels, and manually marking the real sleep stage of each group of PSG signal samples;

and step 3: sequentially cascading a visual Transformer frame-level encoder, a bidirectional GRU sequence-level encoder and a softmax layer to construct a sleep staging network, inputting each group of PSG signal samples into the sleep staging network to predict to obtain a predicted sleep stage of each group of PSG signal samples, initializing the sleep staging network through cross-modal transfer learning, constructing a sleep staging network loss function model by combining the real sleep stages of each group of PSG signal samples, and training through an ADAM (automatic dynamic analysis and analysis) optimizer to obtain an optimized sleep staging network;

and 4, step 4: PSG signals at multiple moments are collected in real time, real-time PSG signal samples are obtained through sliding window processing in the step 1, and the stages of real-time sleep are obtained through predicting the real-time PSG signal samples through an optimized sleep staging network.

Preferably, the original PSG signals of each channel at multiple time points in step 1 are defined as:

datac＝(datac，1，datac，2，...，datac，L)

c∈[1，C]

wherein, dac represents the original PSG signals of the C channel at multiple moments, dac, n represents the original PSG signals of the C channel at the nth moment, n belongs to [1, L ], L represents the number of original moments, and C represents the number of channels;

the window coverage range of the sliding window processing in the step 1 is as follows: (n- (T) ₀ -1)/2) to (n + (T) ₀ -1)/2)；

The window length of the sliding window processing in the step 1 is as follows: t is a unit of ₀ ；

The multiple PSG signal sequences of each channel in step 1 specifically include:

c∈[1，C]

wherein Sdata _c Representing the PSG signal, S, in a sliding window at a plurality of times in the c-th channel _c，i Represents the PSG signal in the sliding window at the ith time of the c-th channel, i ∈ [1, T ₁ ]，T ₁ Represents the number of PSG signal sequences, C represents the number of channels;

preferably, the data enhancement processing in step 2 specifically includes:

signal denoising processing, signal channel interference processing, signal additive noise processing and signal masking frequency processing are respectively carried out data enhancement processing according to certain random probability;

step 2, the multiple groups of data of each channel are subjected to enhancement processing to obtain PSG signal samples, specifically:

c∈[1，C]

wherein, sdata' _c Data enhanced PSG signal, dS ', representing a plurality of instants of the c-th channel' _c，m The mth group data representing the mth channel enhances the processed PSG signal sample, and m belongs to [1, T ] ₁ ]，T ₁ Representing the number of PSG signal samples after data enhancement processing;

step 2, constructing each group of PSG signal samples by the PSG signal samples after the same group of data enhancement processing of the plurality of channels, specifically as follows:

S′ _i ＝(dS′ _1，i ，dS′ _2，i ，...dS′ _C，i )

i∈[1，T ₁ ]

wherein, S' _i Representing the i-th group of PSG signal samples, T ₁ Represents the number of PSG signal samples, C represents the number of channels;

preferably, the visual Transformer frame-level encoder in step 3 is formed by sequentially cascading a time-frequency transform layer, a time-frequency spectrum partitioning layer, a linear projection layer, a position encoding layer, a multi-head attention layer, a full connection layer and a token connection layer;

the time frequency conversion layer converts the ith group of PSG signal samples S' _i Calculating a short-time Fourier transform time-frequency spectrum of the ith group of PSG signal samples through short-time Fourier transform, and expressing the short-time Fourier transform time-frequency spectrum as

F _i ＝(dF _1，i ，dF _2，i ，...dF _C，i )

i∈[1，T ₁ ]

Wherein, F _i Representing the short-time Fourier transform time-frequency spectrum, T, of the ith set of PSG signal samples ₁ Representing the number of PSG signal samples, C representing the number of channels, dF _c，i A short-time Fourier transform time-frequency spectrum representing the ith set of PSG signal samples of the c channel;

will dF _1，i ，dF _2，i ，...dF _G，i Splicing along a frequency axis to obtain a spliced time-frequency spectrum X of the ith group of PSG signal samples _fft，i To X _fft，i Carrying out logarithmic transformation to obtain spliced logarithmic-time frequency spectrum of the ith group of PSG signal samples, and normalizing the spliced logarithmic-time frequency spectrum of the ith group of PSG signal samples by a normal distribution method to obtain normalized time-frequency spectrum X 'of the ith group of PSG signal samples' _fft，i ；

The time frequency spectrum blocking layer normalizes the time frequency spectrum X 'of the ith group of PSG signal samples' _fft，i The patch sequence is divided into N pieces of p × p patch sequences, and is expressed as a partitioned time spectrum of the ith set of PSG signal samples, which is as follows:

X _i ＝(x _1，i ，x _2，i ，...，x _n，i ，...，x _N，i )

n∈[1，N]

wherein x is _n，i Representing the nth patch in the blocked time spectrum of the ith group of PSG signal samples, wherein N is the total number of the patches in the blocked time spectrum of the ith group of PSG signal samples;

each patch in the time spectrum after the blocking of the ith group of PSG signal samples is sequentially converted into a patch vector sequence of the ith group of PSG signal samples through linear projection by the linear projection layer, and the specific definition is as follows:

E _i ＝(E _i，1 ，E _i，2 ，...，E _i，N )

wherein E is _i，n Representing the nth patch vector of the ith group of PSG signal samples, wherein N is the total number of patches in the spectrum after the partitioning of the ith group of PSG signal samples;

the position coding layer embeds the vector random superposition position of each patch to obtain the coded characteristic sequence of the ith group of PSG signal samples, and the specific definition is as follows:

wherein, P _i，n Indicating that the nth patch of the ith group of PSG signal samples is embedded, N is the total number of patches in the frequency spectrum after the blocking of the ith group of PSG signal samples,

representing encoded features of an nth patch vector of an ith set of PSG signal samples;

constructing an input Transformer input characteristic sequence of the ith group of PSG signal samples, which comprises the following specific steps:

wherein,

CLS representing input of ith set of PSG signal samples]The position of the mark is embedded in the mark,

learnable [ CLS ] for sequence start of ith set of PSG signal samples]The mark is marked on the surface of the substrate,

representing the n-th patch coded feature in the blocked time spectrum of the ith group of PSG signal samples;

the multi-head attention layer and the full connection layer are to

Obtaining an output characteristic sequence of the ith group of PSG signal samples through the processing of a multilayer Transformer encoder

Wherein,

an output signature sequence representing the i-th set of PSG signal samples,

CLS representing the output of the ith set of PSG signal samples]The mark is marked on the surface of the substrate,

representing the output characteristic of the nth patch of the ith set of PSG signal samples;

defining the length of a target sleep frame as N ₀ And constructing an output characteristic vector sequence of the ith group of PSG signal samples, which is defined as follows:

the token connection layer is to

And connecting with the average value of Di to obtain the characteristics of a single sleep frame of the ith group of PSG signal samples, which are defined as follows:

wherein,

representing the output characteristic of the nth patch of the ith set of PSG signal samples, concat representing the splice;

will be provided with

Defining a characteristic sequence of a single sleep frame;

step 3, the bidirectional GRU sequence level encoder converts the characteristic sequence of a single sleep frame

Conversion to sequence-level feature vector sequences

Step 3, the softmax layer sequences the sequence-level feature vector sequence

Mapping to a corresponding predicted sleep stage probability sequence, the predicted sleep stage probability sequence being defined as:

π _i ＝(π _i，1 ，π _i，2 ，...，π _i，K ) ^T

wherein, pi _i，k Representing the probability of being predicted as sleep stage k for the ith set of PSG signal samples;

the sleep staging network loss function model in the step 3 specifically comprises:

wherein, y _i A true sleep stage one-hot encoded vector representing the ith set of PSG signal samples,

to predict sleep stage probability sequences;

aiming at the problems of poor robustness, limited feature representation capability, scarce reliable PSG data set, and the like of PSG noise and artifacts in a sleep staging task, a data enhancement module is designed according to the noise and artifacts of a PSG signal, so that the robustness of the model to the noise and artifacts of the PSG signal is improved; a visual Transformer-based encoder is introduced to improve the capability of model feature representation; and the dependence on a large amount of PSG data is relieved through migration learning.

Drawings

FIG. 1: the general structure of the embodiment of the invention;

FIG. 2 is a schematic diagram: according to the embodiment of the invention, the de-noised EEG upper waveform, EOG lower waveform signals and time-frequency spectrograms are obtained;

FIG. 3: according to the method, after signal interference, EEG upper waveform signals, EOG lower waveform signals and time-frequency spectrograms are obtained;

FIG. 4: according to the embodiment of the invention, after high-frequency noise is added, waveform signals above an EEG and waveform signals below an EOG and a time-frequency spectrogram are obtained;

FIG. 5: according to the embodiment of the invention, after low-frequency noise is added, waveform signals above an EEG and waveform signals below an EOG and a time-frequency spectrogram are obtained;

FIG. 6: according to the embodiment of the invention, after low-frequency noise is added, waveform signals above an EEG and waveform signals below an EOG and a time-frequency spectrogram are obtained;

FIG. 7: the cross-mode transfer learning implementation scheme of the embodiment of the invention is shown schematically;

FIG. 8: the invention discloses a schematic diagram of a bidirectional GRU-based sequence encoder;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In specific implementation, a person skilled in the art can implement the automatic operation process by using a computer software technology, and a system device for implementing the method, such as a computer readable storage medium storing a corresponding computer program according to the technical solution of the present invention and a computer device including the corresponding computer program, should also be within the scope of the present invention.

An automatic sleep staging method based on a visual Transformer according to an embodiment of the present invention is described below with reference to fig. 1 to 8, which includes:

step 1, the original PSG signals of each channel at multiple times are defined as:

datac＝(datac，1，datac，2，...，datac，L)

c∈[1，C]

wherein datac represents the original PSG signal at multiple times of the c-th channel, datac, n represents the original PSG signal at the n-th time of the c-th channel, and n ∈ [1, l ]:

the window coverage of the sliding window processing in the step 1 is as follows: (n- (T) ₀ -1)/2) to (n + (T) ₀ -1)/2)；

The window length of the sliding window processing in the step 1 is as follows: t is ₀ ；

c∈[1，C]

wherein Sdata _c Representing the PSG signal, S, in a sliding window at a plurality of times in the c-th channel _c，i Represents the PSG signal in the sliding window at the ith time of the c channel, i ∈ [1, T ] ₁ ]，T ₁ Representing the number of PSG signal sequences, T ₁ =21 denotes the number of original moments, T ₀ =3 indicates the window length of the sliding window processing is, and C =2 indicates the number of channels;

step 2: obtaining multiple groups of PSG signal samples after data enhancement processing of each channel by carrying out data enhancement processing on the multiple PSG signal sequences of each channel, constructing each group of PSG signal samples by using the same group of data-enhanced PSG signal samples of the multiple channels, and manually marking the real sleep stage of each group of PSG signal samples;

the data enhancement processing in the step 2 specifically comprises the following steps:

carrying out data enhancement processing on signal denoising processing, signal channel interference processing, signal additive noise processing and signal masking frequency processing according to certain random probability respectively;

c∈[1，C]

wherein, sdata' _c Data enhanced PSG signal, dS ', representing a plurality of instants in time of the c-th channel' _c，m The mth group of data representing the c channel enhances the processed PSG signal sample, m is [1, T ] ₁ ]，T ₁ Representing the number of PSG signal samples after data enhancement processing;

S′ _i ＝(dS _1，i ，dS′ _2，i ，...dS′ _C，i )

i∈[1，T ₁ ]

wherein, S' _i To representI th group of PSG signal samples, T ₁ Represents the number of PSG signal samples, C represents the number of channels;

and the signal denoising treatment: the low-pass signal and the high-pass signal have important values in sleep-related studies. This enhancement uses bandpass filtering to reduce the noise of the PSG signal. The signal passes through a first order Butterworth filter, retaining only in-band frequencies. The probability of band-pass filtering denoising being active during training is 0.5.

As shown in fig. 2, EEG (upper waveform) and EOG (lower waveform) time domain signals and time spectra after signal denoising processing;

the signal channel interference processing comprises the following steps: since F3 and F4 are relatively close to the eye, eye movement artifacts due to eye movement are picked up by frontal leads and the associated deflection of eye movement can be seen in the EEG signal. The eye movement artifact is embodied in that deflections in the EOG leads occur in the frontal area leads. Similarly, the EOG channel will sometimes receive signals from the EEG channel. This artifact is simulated by superimposing the EEG and EOG signals at a particular scale. During training, the likelihood of signal interference activation is 0.4, where the likelihood of receiving an EEG signal by an EOG channel and the likelihood of receiving an EOG signal by an EEG channel are both 50%.

As shown in fig. 3, EEG (upper waveform) and EOG (lower waveform) time domain signals and time spectra after signal channel interference processing;

the signal additive noise processing: the slow frequency artifacts and muscle artifacts that may occur are simulated by adding high frequency low amplitude or low frequency high amplitude noise on the EEG and EOG channels. The slow frequency artifacts are typically due to sweating or body motion associated with breathing. Sweat changes the potential of the electrode, diluting the conductive medium between the electrode and the skin, creating an artifact that resembles a delta wave. Muscle artifacts are typically produced by local muscle activity, which has a frequency of 20-200Hz. These artifacts are simulated by adding separate, identically distributed high frequency low amplitude or low frequency high amplitude noise on the EEG and EOG channels. The probability of additive noise being active during training is 0.5 and the probability of adding high or low frequency noise is 50% each.

EEG (upper waveform) and EOG (lower waveform) time domain signals and time spectra after addition of high frequency noise and addition of low frequency noise, as shown in fig. 4, 5;

the signal masking frequency processing: masking techniques are widely used in research in the field of audio and video, but the effectiveness of masking strategies in the spectrum of PSG signals has not yet been explored. In this enhancement module, a set of consecutive frequency channels or time steps is masked by using a frequency mask and a time mask. Frequency masking is achieved by filtering the time signal through a band-stop filter, while time masking is achieved by setting consecutive sampling points to zero.

As shown in fig. 6, the signals mask the original time-frequency spectrum and masked time-frequency spectrum of the frequency-processed EEG (upper waveform) and EOG (lower waveform);

and step 3: sequentially cascading a visual Transformer frame-level encoder, a bidirectional GRU sequence-level encoder and a softmax layer to construct a sleep staging network, inputting each group of PSG signal samples into the sleep staging network to predict to obtain a predicted sleep stage of each group of PSG signal samples, initializing the sleep staging network through cross-mode transfer learning, constructing a sleep staging network loss function model by combining the real sleep stages of each group of PSG signal samples, and training through an ADAM (adaptive dynamic analysis and analysis) optimizer to obtain an optimized sleep staging network;

the visual Transformer frame-level encoder is formed by sequentially cascading a time-frequency transform layer, a time-frequency spectrum blocking layer, a linear projection layer, a position encoding layer, a multi-head attention layer, a full connection layer and a token connection layer;

F _i ＝(dF _1，i ，dF _2，i ，...dF _G，i )

i∈[1，T ₁ ]

Wherein, F _i Short-time Fourier transform time-frequency spectrum, T, representing the i-th set of PSG signal samples ₁ Representing the number of PSG signal samples, C representing the number of channels, dF _c，i Denotes the ith group of the c channelA short-time fourier transform time-frequency spectrum of the PSG signal samples;

will dF _1，i ，dF _2，i ，...dF _C，i Splicing along a frequency axis to obtain a spliced time-frequency spectrum X of the ith group of PSG signal samples _fft，i To X _fft，i Performing logarithmic transformation to obtain spliced logarithmic-time frequency spectrum of the ith group of PSG signal samples, and normalizing the spliced logarithmic-time frequency spectrum of the ith group of PSG signal samples by a normal distribution method to obtain normalized time frequency spectrum X 'of the ith group of PSG signal samples' _fft，i ；

The time frequency spectrum blocking layer normalizes the time frequency spectrum X 'of the ith group of PSG signal samples' _fft，i The partitioning into N p × p patch sequences is expressed as a block-wise time spectrum of the ith set of PSG signal samples, which is as follows:

X _i ＝(x _1，i ，x _2，i ，...，x _n，i ，...，x _N，i )

n∈[1，N]

wherein x is _n，i Representing the nth patch in the partitioned time spectrum of the ith group of PSG signal samples, wherein N is the total number of the patches in the partitioned time spectrum of the ith group of PSG signal samples;

each patch in the time spectrum after the blocking of the ith group of PSG signal samples is sequentially converted into a patch vector sequence of the ith group of PSG signal samples through linear projection by the linear projection layer, which is specifically defined as follows:

E _i ＝(E _i，1 ，E _i，2 ，...，E _i，N )

wherein, P _i，n Indicating that the position of the nth patch of the ith group of PSG signal samples is embedded, N is the total number of patches in the spectrum after the blocking of the ith group of PSG signal samples,

representing the encoded features of the nth patch vector of the ith set of PSG signal samples;

constructing an input Transformer input characteristic sequence of the ith group of PSG signal samples, which comprises the following steps:

wherein,

learnable [ CLS ] for the beginning of a sequence of i-th group of PSG signal samples]The mark is marked on the surface of the substrate,

the multi-head attention layer and the full connection layer are to

Wherein,

representing the output signature sequence of the ith set of PSG signal samples,

the token connection layer is to

wherein,

step 3, initializing the sleep staging network by the cross-modal transfer learning, specifically: channel weight averaging, input length adaptation.

The channel weight average: and averaging the weights corresponding to the three input channels of the linear projection layer of the pre-trained Transformer to serve as the weight of the linear projection layer of the visual Transformer frame-level encoder.

The input length is adaptive: since the input shape of the pre-trained Transformer is fixed (224 × 224 or 384 × 384), this is different from a typical PSG time-frequency diagram. In order to solve the problem of inconsistent position embedding length, the position embedding is self-adapted by using a cutting method and a bilinear interpolation method.

FIG. 7 is a schematic diagram of a cross-modal migration learning implementation;

will be provided with

A feature sequence defined as a single sleep frame;

step 3, the bidirectional GRU sequence level coder converts the characteristic sequence of a single sleep frame

Conversion to sequence-level feature vector sequences

Fig. 8 is a schematic diagram of a bidirectional GRU based sequence encoder;

step 3, the softmax layer sequences the sequence-level feature vector sequence

π _i ＝(π _i，1 ，π _i，2 ，...，π _i，K ) ^τ

wherein, y _i Representing the one-hot encoded vector of the i-th group of PSG signal samples;

and 4, step 4: the polysomnography monitoring device is worn on a human body according to sleep medical standards, PSG signals at multiple moments are collected in real time and transmitted to a computer, the computer processes the PSG signals collected at the multiple moments in real time through a sliding window in the step 1 to obtain real-time PSG signal samples, and the real-time PSG signal samples are predicted through an optimized sleep staging network to obtain the stages of the real-time sleep.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A vision transducer-based automatic sleep staging method is characterized by comprising the following steps:

and 3, step 3: sequentially cascading a visual Transformer frame-level encoder, a bidirectional GRU sequence-level encoder and a softmax layer to construct a sleep staging network, inputting each group of PSG signal samples into the sleep staging network to predict to obtain a predicted sleep stage of each group of PSG signal samples, initializing the sleep staging network through cross-modal transfer learning, constructing a sleep staging network loss function model by combining the real sleep stages of each group of PSG signal samples, and training through an ADAM (automatic dynamic analysis and analysis) optimizer to obtain an optimized sleep staging network;

and 4, step 4: PSG signals at multiple moments are collected in real time, real-time PSG signal samples are obtained through sliding window processing in the step 1, and the real-time PSG signal samples are predicted through an optimized sleep staging network to obtain real-time sleep stages.

2. The visual transducer-based automated sleep staging method according to claim 1, characterized in that:

datac＝(datac，1，datac，2，...，datac，L)

c∈[1，C]

wherein, dac represents the original PSG signals of multiple moments of the C-th channel, dac, n represents the original PSG signals of the n-th moment of the C-th channel, n belongs to [1, L ], L represents the number of original moments, and C represents the number of channels;

the window coverage of the sliding window processing in the step 1 is as follows:

(n-(T ₀ -1)/2) to (n + (T) ₀ -1)/2)；

c∈[1，C]

wherein Sdata _c Representing the PSG signal, S, in a sliding window at a plurality of times in the c-th channel _c，i Represents the PSG signal in the sliding window at the ith time of the c channel, i ∈ [1, T ] ₁ ]，T ₁ Indicates the number of PSG signal sequences, and C indicates the number of channels.

3. The visual Transformer-based automated sleep staging method according to claim 1, characterized in that:

c∈[1，C]

wherein, sdata' _c Data enhanced PSG signal, dS ', representing a plurality of instants in time of the c-th channel' _c，m The mth group of data representing the c channel, m ∈ [1, T, samples of the post-enhancement PSG signal ₁ ]，T ₁ Representing the number of PSG signal samples after data enhancement processing;

S′ _i ＝(dS′ _1，i ，dS′ _2，i ，...dS′ _G，i )

i∈[1，T ₁ ]

wherein, S' _i Representing the ith set of PSG signal samples, T ₁ Representing the number of PSG signal samples and C representing the number of channels.

4. The visual Transformer-based automated sleep staging method according to claim 1, characterized in that:

the time frequency conversion layer converts the ith group of PSG signal samples

Calculating a short-time Fourier transform time-frequency spectrum of the ith group of PSG signal samples through short-time Fourier transform, and expressing the short-time Fourier transform time-frequency spectrum as

F _i ＝(dF _1，i ，dF _2，i ，...dF _G，i )

i∈[1，T ₁ ]

Wherein, F _i Representing the short-time Fourier transform time-frequency spectrum, T, of the ith set of PSG signal samples ₁ Representing the number of PSG signal samples, C representing the number of channels, dF _c，i A short-time Fourier transform time-frequency spectrum representing the ith group of PSG signal samples of the c channel;

will dF _1，i ，dF _2，i ，...dF _G，i Splicing along a frequency axis to obtain a spliced time-frequency spectrum X of the ith group of PSG signal samples _fft，i To X _fft，i Carrying out logarithmic transformation to obtain spliced logarithmic time spectrum of the ith group of PSG signal samples, and normalizing the spliced logarithmic time spectrum of the ith group of PSG signal samples by a normal distribution method to obtain normalized time spectrum of the ith group of PSG signal samples

The time frequency spectrum blocking layer normalizes the time frequency spectrum of the ith group of PSG signal samples

The patch sequence divided into N pieces of p × p size is expressed as the i-th set of PSG signalsThe time-frequency spectrum after the sample is partitioned specifically is as follows:

X _i ＝(x _1，i ，x _2，i ，...，x _n，i ，...，x _N，i )

n∈[1，N]

E _i ＝(E _i，1 ，E _i，2 ，...，E _i，N )

wherein, E _i，n Representing the nth patch vector of the ith group of PSG signal samples, wherein N is the total number of patches in the spectrum after the blocking of the ith group of PSG signal samples;

the position coding layer embeds the vector random superposition position of each patch to obtain the coded feature sequence of the ith group of PSG signal samples, and the specific definition is as follows: