CN116364096A

CN116364096A - Electroencephalogram signal voice decoding method based on generation countermeasure network

Info

Publication number: CN116364096A
Application number: CN202310220138.1A
Authority: CN
Inventors: 张韶岷; 刘腾俊; 冉星辰; 万子俊; 李悦; 郑能干; 陈卫东
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-06-30
Anticipated expiration: 2043-03-09
Also published as: CN116364096B

Abstract

The invention discloses an electroencephalogram signal voice decoding method based on a generation countermeasure network, which is used for carrying out voice decoding on electroencephalogram signals and synthesizing understandable vowel audio. According to the method, the tested electroencephalogram signals and the voice data which are synchronously collected are utilized, after effective preprocessing, the mapping relation from the anti-network learning electroencephalogram signals to the voice data is generated, so that the problem of over-smoothing can be effectively relieved, and the audio synthesized by decoding voice features has better intelligibility. The method consists of a generator and a discriminator. The generator is responsible for performing dimension reduction processing on the nerve characteristics and generating voice characteristics; the arbiter is responsible for determining the authenticity of the speech feature. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.

Description

Electroencephalogram signal voice decoding method based on generation countermeasure network

Technical Field

The invention belongs to the technical field of electroencephalogram data analysis, and particularly relates to an electroencephalogram signal voice decoding method based on a generation countermeasure network (Generative Adversarial Networks, GAN).

Background

In recent years, the technology of speech decoding in the field of brain-computer interfaces is rapidly developed, and at present, the research of the speech brain-computer interfaces can be divided into two different technical routes. The first class is discrete decoding, which classifies recorded neural signals into corresponding phonetic representations, e.g., phonemes, text, etc., within a finite set of classes; the second type is continuous decoding, i.e. the neural signals are directly decoded into acoustic features or vocal organ motion trajectories corresponding to the speech. The difficulty of decoding the discrete voice features is relatively low, but phonemes or words with similar categories are difficult to distinguish, and the classification set is limited; the continuous voice feature decoding can realize the output of any voice mode, and can synthesize emotion voice through the features of intonation, pause and the like, but the method has the defects of difficult decoding and higher precision requirement on decoding results.

Currently, the most common methods for continuous speech feature decoding are neural networks, including Long Short-term Memory (LSTM), self-encoder (Autoencoder), deep neural networks (Deep Neural Network, DNN), etc. Thanks to the rapid development of deep learning, these neural network approaches achieve higher accuracy than traditional machine learning models. However, these methods still face problems, and the overcomplete of the decoded speech features is one of the important factors. The reason for this problem is that the distribution of the decoded speech features tends to converge near the mean value of the training data, resulting in loss of high-frequency information. The distribution situation of the fitting data of the countermeasure network can be better compared with the neural networks, and the problem of over-smoothing can be effectively relieved. This is of great value for improving the quality of speech decoding and the intelligibility of the synthesized audio.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for decoding brain electrical signals based on generating an antagonistic network, which uses the generated antagonistic network to learn the mapping relationship between brain electrical signals and voice data after effectively preprocessing by using the synchronously acquired tested brain electrical signals and voice data, so that the method can effectively alleviate the problem of overcomplete, and the audio synthesized by decoding voice features has better intelligibility.

An electroencephalogram signal voice decoding method based on a generated countermeasure network is realized through the following steps:

(1) Preprocessing the acquired original electroencephalogram signals by using a time-frequency analysis method, and taking the power spectral densities of different frequency bands of the electroencephalogram signals as nerve characteristics;

(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the voice signal of the public corpus, and the mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the tested voice signal serving as voice features, and then aligning the voice features and nerve features;

(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals by using a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;

(4) The neural characteristic aligned in the step (2) is used as input data, the voice characteristic aligned in the step (2) and the electromagnetic joint contrast data tested in the step (3) are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;

(5) Inputting the neural characteristics of the test set into the generating countermeasure network for voice decoding, calculating to obtain decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic part and electromagnetic joint contrast data, and synthesizing into voice waveforms by using a vocoder.

In the step (1), the pretreatment specifically includes:

firstly, filtering and downsampling the original electroencephalogram signals of each channel; then extracting the power spectrum density; and finally, dividing the extracted power spectral density into a plurality of frequency bands with different bandwidths according to the center frequency to obtain the power spectral densities of the different frequency bands of the electroencephalogram signal.

Firstly, filtering an original nerve signal by using a common mode median reference algorithm; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.

The speech signals in step (2) include speech signals in the public corpus and speech signals that are. The tested voice signals are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is tested for repeated reading, and are synchronously collected with the nerve data through the nerve signal processor. And then preprocessing the voice signals in the public corpus and the tested voice signals by using a voice signal processing tool to obtain the mel-cepstrum features of the voice signals in the public corpus, and the mel-cepstrum features, the fundamental frequency F0 and the aperiodic parts of the tested voice signals as voice features, and then aligning the voice features and the nerve features.

In the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:

recording and disclosing speech signals of a corpusThe mel-frequency cepstrum is characterized in that

The mel-cepstrum characteristic of the tested speech signal is +.>

Wherein->

Feature dimension representing mel-cepstrum feature, < >>

And->

Respectively->

And->

Is recorded with the sequence length of dynamic time warping algorithm to generate +.>

And->

Is->

Specifically, the number of the cells, specifically,

represents->

Middle->

Point should and->

Middle->

Pairs of pointsEven (I) at>

And->

Is obtained by minimizing the following equation:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the length of the path of the dynamic time warping algorithm, calculated as +.>

Then, get->

Relative to->

Ji Suoyin sequence->

：

Wherein j is a temporary index, the minimum number satisfying p (j) is equal to or larger than i is taken,

representative sequence->

And->

Alignment time->

Indexing, electromagnetic joint contrast data in the public corpus by means of Ji Suoyin sequenceAnd performing row interpolation to generate electromagnetic joint contrast data of a tested person.

In step (4), the generating the countermeasure network model includes: a generator and a arbiter;

the generator is provided with

The system comprises a feature dimension reduction module and a decoding module, wherein the feature dimension reduction module is used for decoding the neural features into voice features and electromagnetic joint contrast data;

the said discriminator is used for determining the authenticity of the generated voice characteristic and electromagnetic joint contrast data, and comprises a voice discriminator

And electromagnetic joint contrast data discriminator->

For determining speech features and electromagnetic joint contrast data, respectively.

In the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:

inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real target

；

Meanwhile, the real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment scores of the generated voice characteristic and the real voice characteristic, and then the loss function between the judgment scores and the labels is calculated

；

The real voice characteristic is used as a condition, electromagnetic joint contrast data and real electromagnetic joint contrast data in the decoding result are respectively input into an electromagnetic joint contrast data discriminator to calculate and obtain a loss function

The three parts of the loss function are weighted and summed to obtain the loss function +.>

The network parameters are updated by back propagation of an adaptive motion estimation optimization algorithm, and the network parameters are iterated in sequence until a loss function is +.>

Convergence, wherein->

Representative generator->

Is a voice discriminator>

For electromagnetic joint contrast data discriminator, +.>

Representing the network parameters of the entire generated countermeasure network.

In step (4), a speech discriminator is used

And electromagnetic joint contrast data discriminator->

The two different targets in the voice decoding task are divided, so that the discrimination of the network on the motion parameters and the voice features is carried out separately, the closer mapping relation between the nerve features and the magnetic joint contrast data can be effectively utilized, the constraint and the supplement of the mapping between the nerve features and the voice features are realized, the learning capacity of the network can be improved, the high-frequency part of the voice features can be supplemented, more voice detail features are provided, and the voice decoding effect is improved.

Compared with the prior art, the invention has the following beneficial effects: the invention develops a neural signal voice decoding method based on a generated countermeasure network, carries out voice decoding on brain electricity, synthesizes understandable vowel audio, realizes the decoding and synthesis of continuous voice features through an algorithm, has higher precision of decoding results and has better understandability of synthesized audio. The method consists of a generator and a discriminator. The generator network consists of a dimension reduction module and a decoding module, and is respectively responsible for dimension reduction processing of the neural characteristics and generation of voice characteristics; the discriminator network is composed of stacked convolution layers and is responsible for determining the authenticity of the speech features. The generation countermeasure network algorithm adopted by the invention relieves the overcomplete of the voice characteristics, improves the accuracy of decoding the voice characteristics, and experiments show that the correlation coefficient of the decoding voice characteristics on a tested with French as a mother language is 0.51 on average, the mel cepstrum distortion is 4.64 on average, and the short-time objective intelligibility of the synthesized audio is 0.48 on average. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.

Drawings

Fig. 1 is a schematic diagram of a structure for generating an reactance network.

Fig. 2 is a graph of correlation coefficients for a portion of speech features.

Fig. 3 is a flow chart of an electroencephalogram signal voice decoding method based on a generation countermeasure network.

Detailed Description

In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 3, the electroencephalogram signal voice decoding method based on the generation of the countermeasure network of the present invention comprises the following steps:

(1) Preprocessing the acquired original brain electrical signals by using a time-frequency analysis method, and extracting power spectral densities of different frequency bands of the nerve signals as nerve characteristics;

(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus, and mel-cepstrum features, fundamental frequency F0 and non-periodic parts of the tested voice signal which are used as voice features, and then aligning the voice features and nerve features;

(3) Establishing an optimal alignment path of the mel-cepstrum features of the voice signals of the public corpus and the mel-cepstrum features of the tested voice signals by using a specific public corpus through a dynamic time warping (Dynamic Time Wrapping, DTW) algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;

(4) The aligned neural characteristics are used as input data, the aligned voice characteristics and the tested electromagnetic joint contrast data are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;

(5) The neural characteristics of the test set are input into a trained generation countermeasure network model, decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic parts and EMA are obtained through calculation, and then the voice characteristics are synthesized into voice waveforms by using a vocoder.

In the preprocessing process in the step (1), firstly, a common mode median reference algorithm is used for filtering an original nerve signal; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.

The voice signals in the step (2) are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is read by a tested and repeated for a plurality of times, and are synchronously acquired with the nerve data through the nerve signal processor. Preprocessing a voice signal and a tested voice signal in the public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus and voice features (comprising the mel-cepstrum features, the fundamental frequency F0 and the aperiodic part) of the tested voice signal, and then aligning the voice features and the nerve features.

In the step (3), the EMA data process of the patient is inferred by using a DTW algorithm: the Mel cepstrum features of the speech signal of the public corpus are recorded as follows

The Meier cepstrum characteristic of the acquired tested voice signal is +.>

. Wherein (1)>

Feature dimension representing mel-cepstrum feature, < >>

And->

Respectively->

And->

Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->

And->

Is->

. Specifically, the->

，/>

Represents->

Middle->

Point stress/>

Middle->

The points are aligned.

And->

Is obtained by minimizing the following equation:

wherein the method comprises the steps of

Representing the length of the DTW path;

calculated to obtain

After that, +.>

Relative to->

Is a sequence of alignment indexes of (a):

where j is a temporary index. In this way the first and second light sources,

representative sequence->

And->

Alignment time->

Each point of the index sequence represents the sequence +.>

And->

Index at alignment. Interpolation is carried out on EMA data of the public corpus through an index sequence, and the EMA data is used as tested EMA data.

The generation reaction network for speech decoding in step (4) includes a generator and a arbiter. Wherein the generatorGThe system is used for decoding the neural characteristics into voice characteristics and EMA, and consists of a characteristic dimension reduction module and a decoding module; the discriminator is used for judging the authenticity of the generated voice characteristic and EMA, and comprises a voice discriminator

And electromagnetic joint contrast data discriminator->

The specific process of training the generation of the countermeasure network is as follows: inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real target

The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the real EMA is used as a condition, a voice characteristic part and a real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain a judgment score for generating the voice characteristic and the real voice characteristic, and a loss function between the judgment score and the label is calculated>

The method comprises the steps of carrying out a first treatment on the surface of the Similarly, using the real speech feature as a condition, the EMA part and the real EMA in the decoding result are respectively input into the EMA discriminator to calculate the loss function +.>

. The three partial loss functions are weighted and summed to obtain a total loss function +.>

The network parameters are updated by back propagation of an Adam optimization algorithm, and iteration is sequentially carried out until a loss function is achieved

And (5) convergence. Wherein, the liquid crystal display device comprises a liquid crystal display device,θrepresenting the network parameters of the entire generated countermeasure network.

Loss function

The expression of (2) is as follows:

representative generator->

Representing any arbiter, c being the condition item of the current arbiter, +.>

For real target data->

For input neural characteristics, < >>

For decoding result->

Is a voice discriminator>

For EMA discriminator, < >>

Network parameters representing the entire generated countermeasure network, +.>

Is->

Is a weight of (2).

The invention can carry out voice decoding on the brain electrical signal and synthesize intelligible voice audio.

In combination with an example, the invention relates to an electroencephalogram signal voice decoding method based on an antagonism network generation, which comprises the following steps: (1) data acquisition. 97 phrases and 11 vowels are selected from the public corpus BY2014, the experiment requires the subjects to read the appointed vowels or phrases, and a neural signal processor of a Blackrock company is used for synchronously acquiring and recording neural signals and voice data in the experimental process. Each experiment time was about 30 minutes.

The specific flow of the data acquisition experiment is as follows: firstly, randomly selecting a short sentence or 2-3 vowel sequences from selected contents, and reading the screen contents aloud by a subject after the subject sees the screen contents; after that, the screen is randomly blacked out for a period of 0 to 1 s; subsequently, a cross prompt appears on the screen, and the subject needs to read the content before loud repetition at the moment; finally, the first step is restarted after the screen is again blacked out for a period of time within 1 s.

(2) Preprocessing the neural signals and voice data acquired in the step (1):

for neural data, first common mode median reference filtering is used on the original neural signal; then filtering the neural signal using a low pass filter with a cut-off frequency of 500 Hz; then downsampling the nerve signal from 30kHz to 2kHz; finally, a multi-window power spectrum estimation algorithm is used for extracting time-frequency characteristics from the nerve signals, the step length is 10ms, the extracted characteristics are divided into 21 frequency bands (0-10 Hz, 10-20Hz, 20-30Hz, …, 190-200Hz and 200-210 Hz) with the bandwidth of 10Hz according to the central frequency, and all time spectrums in each frequency band are added evenly to be used as the nerve characteristics of the frequency band.

For speech data, each sample is first sliced from the recorded continuous data stream. From the speech waveform, data from 500ms before speaking to 500ms after speaking in each experiment was taken as one sample. The same segmentation is required to be performed on the neural features according to the time labels while the voice data is segmented. Then, extracting 25-dimensional mel-cepstrum features and 1-dimensional fundamental frequency (pitch) from each voice sample by using an SPTK kit; then, the WORLD toolkit is used to extract 2-dimensional voice aperiodicity. A total of 28 dimensions of speech features were obtained, again with a step size of 10ms.

(3) EMA data for the subject is inferred using the DTW algorithm. The mel cepstrum features of the speech signal in the BY2014 corpus are recorded as

The mel-cepstrum characteristic of the subject's speech signal obtained in step (2) is +.>

. Wherein (1)>

Feature dimension representing speech data, < >>

And->

Respectively->

And->

Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->

And->

Is->

. Specifically, the->

， />

Represents->

Middle->

Point should and->

Middle->

The points are aligned. />

And->

Is obtained by minimizing the following equation:

wherein the method comprises the steps of

Representing the length of the DTW path;

calculated to obtain

After that, +.>

Relative to->

Is a sequence of alignment indexes of (a):

where j is a temporary index. In this way the first and second light sources,

representative sequence->

And->

Alignment time->

Each point of the index sequence represents the sequence +.>

And->

Index at alignment. Interpolation is carried out on EMA data in the BY2014 corpus through an index sequence, and the interpolation is used as tested EMA data.

(4) The neural characteristics, the voice characteristics and the tested EMA data obtained in the above steps are used for training to generate a countermeasure network SpeechGAN, the structure of which is shown in fig. 1, and for the training process:

first, neural time-frequency features are used

Input to the generator->

Is decoded by the speech feature->

(including: mel-cepstrum, aperiodic excitation parameters and pitch) and electromagnetic angiography (EMA) data +.>

：

Then, respectively, the real voice data

And EMA data->

Is input to two condition discriminators (speech discriminator)

And motion discriminator->

Motion discriminator->

I.e. electromagnetic angiography data arbiter):

obtained by

And->

The judgment values of the real voice data and the EMA data are respectively;

then, the generated voice data are respectively processed

And EMA data->

Input into two condition discriminators:

obtained by

And->

Respectively generating judgment values of voice data and EMA data;

calculating a loss function of the discriminator:

after the loss functions of the two discriminators are obtained through the method, the Adam optimizer is utilized to optimize the two discriminators

Is updated. Speech feature to be decoded->

And EMA data->

Again input into the updated arbiter for decision:

and->

Respectively representing the speech discriminator and the EMA discriminator after the update of the present round,/for each round>

And->

Respectively representing discrimination values obtained using the updated discriminators.

Calculating a loss function of the generator:

representative generator->

Weight lost for L2, +.>

And true speech features and EMA, +.>

For the generated speech features and EMA. The generator is obtained by the above method>

After the loss function of the generator, updating the network weight of the generator by using an Adam optimizer to complete a round of training.

(5) The modem contains 634 samples in total, and 10 fold cross-validation is used to determine the decoding performance of the generated challenge network. In the training process, training data is further divided into a training set and a verification set, decoded mel cepstrum features M, fundamental frequency F0, aperiodic sections Amp and EMA (comprising horizontal positions of upper and lower lip mark points, vertical positions of upper and lower lip mark points and the like) are obtained through calculation, and an optimal model is determined through accuracy of decoding results of the verification set. Under 10-fold cross validation, the pearson correlation coefficient between the decoding result and the real data is calculated, the average correlation coefficient is 0.51, and the average mel cepstrum distortion is 4.64. Fig. 2 shows correlation coefficients between partial decoding results and real data, including: in EMA, the correlation coefficient of the features such as the upper lip mark point horizontal position Ux, the upper lip mark point vertical position Uy, the lower lip mark point horizontal position Dx, the lower lip mark point vertical position Dy, the fundamental frequency F0, the non-periodic part AMP1 feature and the mel-frequency cepstrum M0 feature can reach more than 0.8 at the highest.

(6) And (3) inputting the decoded mel-frequency cepstrum feature, F0 and aperiodicity obtained in the step (5) into a WORLD vocoder to synthesize a voice waveform. The average short-time objective intelligibility of the synthesized speech was 0.48. Comparing the real sample, the conventional GAN generated sample and the Mel frequency spectrum of the inventive (SpeechGAN) generated sample, the inventive generated Mel frequency spectrum is closer to the real sample than the conventional GAN generated sample, and the SpeechGAN has better characterization on Mel frequency spectrum, especially on high frequency part under the two conditions of vowels and sentences, so that the problem of over-smoothing can be effectively avoided, the voice reduction degree of the decoding result synthesis is higher, the decoding precision of the inventive method is better, and the synthesized audio is more understandable.

The invention develops the neural signal voice decoding method based on the generation countermeasure network, realizes the decoding and synthesis of continuous voice characteristics through the algorithm, has higher precision of decoding results and better comprehensibility of synthesized audio. Meanwhile, the generation countermeasure network algorithm adopted by the invention relieves the problem of overcorrection of voice characteristics.

Claims

1. An electroencephalogram signal voice decoding method based on a generated countermeasure network is characterized by comprising the following steps:

(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals through a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by utilizing the electromagnetic joint radiography data in the public corpus;

(4) Taking the neural characteristic aligned in the step (2) as input data, taking the voice characteristic aligned in the step (2) and the electromagnetic angiography data tested in the step (3) as decoding objects together, inputting the voice characteristic and the electromagnetic angiography data into a generated countermeasure network model for training, and constructing a generated countermeasure network for voice decoding;

2. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (1), the pretreatment specifically includes:

3. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:

the Mel cepstrum features of the speech signal of the public corpus are recorded as follows

The mel-cepstrum characteristic of the tested speech signal is +.>

Wherein->

Feature dimension representing mel-cepstrum feature, < >>

And->

Respectively->

And->

Is->

Specifically, the->

，/>

Representative of

Middle->

Point should and->

Middle->

Alignment of individual points->

And->

Is obtained by minimizing the following equation:

Then, get->

Relative to->

Ji Suoyin sequence->

：

Wherein j is a temporary index,

representative sequence->

And->

Alignment time->

Index by ∈ Ji Suoyin>

Interpolation is carried out on the electromagnetic joint contrast data in the public corpus, and the tested electromagnetic joint contrast data are generated.

4. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in step (4), the generating the countermeasure network model includes: a generator and a arbiter.

5. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the generator is used for decoding the neural characteristics into voice characteristics and electromagnetic joint contrast data, and comprises a characteristic dimension reduction module and a decoding module.

6. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the discriminator is used for determining the authenticity of the generated voice feature and the electromagnetic joint contrast data, and comprises a voice discriminator and an electromagnetic joint contrast data discriminator which are respectively used for determining the voice feature and the electromagnetic joint contrast data.

7. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:

inputting samples into a generator one by one according to batches to obtain a decoding result output by the generator, and calculating a loss function between the decoding result and a real target

；

The real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment score of the generated voice characteristic and the real voice characteristic, and the loss function between the judgment score and the label is calculated

；

；

The three loss functions are weighted and summed to obtain the loss function

Convergence, wherein G represents the generator, D1 is the speech discriminator, D2 is the electromagnetic joint contrast data discriminator, and θ represents the network parameters of the whole generated countermeasure network.