CN116364096A - Electroencephalogram signal voice decoding method based on generation countermeasure network - Google Patents

Electroencephalogram signal voice decoding method based on generation countermeasure network Download PDF

Info

Publication number
CN116364096A
CN116364096A CN202310220138.1A CN202310220138A CN116364096A CN 116364096 A CN116364096 A CN 116364096A CN 202310220138 A CN202310220138 A CN 202310220138A CN 116364096 A CN116364096 A CN 116364096A
Authority
CN
China
Prior art keywords
voice
decoding
data
mel
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310220138.1A
Other languages
Chinese (zh)
Other versions
CN116364096B (en
Inventor
张韶岷
刘腾俊
冉星辰
万子俊
李悦
郑能干
陈卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310220138.1A priority Critical patent/CN116364096B/en
Publication of CN116364096A publication Critical patent/CN116364096A/en
Application granted granted Critical
Publication of CN116364096B publication Critical patent/CN116364096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses an electroencephalogram signal voice decoding method based on a generation countermeasure network, which is used for carrying out voice decoding on electroencephalogram signals and synthesizing understandable vowel audio. According to the method, the tested electroencephalogram signals and the voice data which are synchronously collected are utilized, after effective preprocessing, the mapping relation from the anti-network learning electroencephalogram signals to the voice data is generated, so that the problem of over-smoothing can be effectively relieved, and the audio synthesized by decoding voice features has better intelligibility. The method consists of a generator and a discriminator. The generator is responsible for performing dimension reduction processing on the nerve characteristics and generating voice characteristics; the arbiter is responsible for determining the authenticity of the speech feature. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.

Description

Electroencephalogram signal voice decoding method based on generation countermeasure network
Technical Field
The invention belongs to the technical field of electroencephalogram data analysis, and particularly relates to an electroencephalogram signal voice decoding method based on a generation countermeasure network (Generative Adversarial Networks, GAN).
Background
In recent years, the technology of speech decoding in the field of brain-computer interfaces is rapidly developed, and at present, the research of the speech brain-computer interfaces can be divided into two different technical routes. The first class is discrete decoding, which classifies recorded neural signals into corresponding phonetic representations, e.g., phonemes, text, etc., within a finite set of classes; the second type is continuous decoding, i.e. the neural signals are directly decoded into acoustic features or vocal organ motion trajectories corresponding to the speech. The difficulty of decoding the discrete voice features is relatively low, but phonemes or words with similar categories are difficult to distinguish, and the classification set is limited; the continuous voice feature decoding can realize the output of any voice mode, and can synthesize emotion voice through the features of intonation, pause and the like, but the method has the defects of difficult decoding and higher precision requirement on decoding results.
Currently, the most common methods for continuous speech feature decoding are neural networks, including Long Short-term Memory (LSTM), self-encoder (Autoencoder), deep neural networks (Deep Neural Network, DNN), etc. Thanks to the rapid development of deep learning, these neural network approaches achieve higher accuracy than traditional machine learning models. However, these methods still face problems, and the overcomplete of the decoded speech features is one of the important factors. The reason for this problem is that the distribution of the decoded speech features tends to converge near the mean value of the training data, resulting in loss of high-frequency information. The distribution situation of the fitting data of the countermeasure network can be better compared with the neural networks, and the problem of over-smoothing can be effectively relieved. This is of great value for improving the quality of speech decoding and the intelligibility of the synthesized audio.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for decoding brain electrical signals based on generating an antagonistic network, which uses the generated antagonistic network to learn the mapping relationship between brain electrical signals and voice data after effectively preprocessing by using the synchronously acquired tested brain electrical signals and voice data, so that the method can effectively alleviate the problem of overcomplete, and the audio synthesized by decoding voice features has better intelligibility.
An electroencephalogram signal voice decoding method based on a generated countermeasure network is realized through the following steps:
(1) Preprocessing the acquired original electroencephalogram signals by using a time-frequency analysis method, and taking the power spectral densities of different frequency bands of the electroencephalogram signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the voice signal of the public corpus, and the mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the tested voice signal serving as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals by using a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;
(4) The neural characteristic aligned in the step (2) is used as input data, the voice characteristic aligned in the step (2) and the electromagnetic joint contrast data tested in the step (3) are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;
(5) Inputting the neural characteristics of the test set into the generating countermeasure network for voice decoding, calculating to obtain decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic part and electromagnetic joint contrast data, and synthesizing into voice waveforms by using a vocoder.
In the step (1), the pretreatment specifically includes:
firstly, filtering and downsampling the original electroencephalogram signals of each channel; then extracting the power spectrum density; and finally, dividing the extracted power spectral density into a plurality of frequency bands with different bandwidths according to the center frequency to obtain the power spectral densities of the different frequency bands of the electroencephalogram signal.
Firstly, filtering an original nerve signal by using a common mode median reference algorithm; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.
The speech signals in step (2) include speech signals in the public corpus and speech signals that are. The tested voice signals are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is tested for repeated reading, and are synchronously collected with the nerve data through the nerve signal processor. And then preprocessing the voice signals in the public corpus and the tested voice signals by using a voice signal processing tool to obtain the mel-cepstrum features of the voice signals in the public corpus, and the mel-cepstrum features, the fundamental frequency F0 and the aperiodic parts of the tested voice signals as voice features, and then aligning the voice features and the nerve features.
In the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:
recording and disclosing speech signals of a corpusThe mel-frequency cepstrum is characterized in that
Figure SMS_6
The mel-cepstrum characteristic of the tested speech signal is +.>
Figure SMS_7
Wherein->
Figure SMS_14
Feature dimension representing mel-cepstrum feature, < >>
Figure SMS_4
And->
Figure SMS_12
Respectively->
Figure SMS_9
And->
Figure SMS_16
Is recorded with the sequence length of dynamic time warping algorithm to generate +.>
Figure SMS_5
And->
Figure SMS_13
Is->
Figure SMS_8
Specifically, the number of the cells, specifically,
Figure SMS_17
represents->
Figure SMS_3
Middle->
Figure SMS_11
Point should and->
Figure SMS_2
Middle->
Figure SMS_10
Pairs of pointsEven (I) at>
Figure SMS_1
And->
Figure SMS_15
Is obtained by minimizing the following equation:
Figure SMS_18
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_19
representing the length of the path of the dynamic time warping algorithm, calculated as +.>
Figure SMS_20
Then, get->
Figure SMS_21
Relative to->
Figure SMS_22
Ji Suoyin sequence->
Figure SMS_23
Figure SMS_24
Wherein j is a temporary index, the minimum number satisfying p (j) is equal to or larger than i is taken,
Figure SMS_25
representative sequence->
Figure SMS_26
And->
Figure SMS_27
Alignment time->
Figure SMS_28
Indexing, electromagnetic joint contrast data in the public corpus by means of Ji Suoyin sequenceAnd performing row interpolation to generate electromagnetic joint contrast data of a tested person.
In step (4), the generating the countermeasure network model includes: a generator and a arbiter;
the generator is provided with
Figure SMS_29
The system comprises a feature dimension reduction module and a decoding module, wherein the feature dimension reduction module is used for decoding the neural features into voice features and electromagnetic joint contrast data;
the said discriminator is used for determining the authenticity of the generated voice characteristic and electromagnetic joint contrast data, and comprises a voice discriminator
Figure SMS_30
And electromagnetic joint contrast data discriminator->
Figure SMS_31
For determining speech features and electromagnetic joint contrast data, respectively.
In the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:
inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real target
Figure SMS_32
Meanwhile, the real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment scores of the generated voice characteristic and the real voice characteristic, and then the loss function between the judgment scores and the labels is calculated
Figure SMS_33
The real voice characteristic is used as a condition, electromagnetic joint contrast data and real electromagnetic joint contrast data in the decoding result are respectively input into an electromagnetic joint contrast data discriminator to calculate and obtain a loss function
Figure SMS_34
The three parts of the loss function are weighted and summed to obtain the loss function +.>
Figure SMS_35
The network parameters are updated by back propagation of an adaptive motion estimation optimization algorithm, and the network parameters are iterated in sequence until a loss function is +.>
Figure SMS_36
Convergence, wherein->
Figure SMS_37
Representative generator->
Figure SMS_38
Is a voice discriminator>
Figure SMS_39
For electromagnetic joint contrast data discriminator, +.>
Figure SMS_40
Representing the network parameters of the entire generated countermeasure network.
In step (4), a speech discriminator is used
Figure SMS_41
And electromagnetic joint contrast data discriminator->
Figure SMS_42
The two different targets in the voice decoding task are divided, so that the discrimination of the network on the motion parameters and the voice features is carried out separately, the closer mapping relation between the nerve features and the magnetic joint contrast data can be effectively utilized, the constraint and the supplement of the mapping between the nerve features and the voice features are realized, the learning capacity of the network can be improved, the high-frequency part of the voice features can be supplemented, more voice detail features are provided, and the voice decoding effect is improved.
Compared with the prior art, the invention has the following beneficial effects: the invention develops a neural signal voice decoding method based on a generated countermeasure network, carries out voice decoding on brain electricity, synthesizes understandable vowel audio, realizes the decoding and synthesis of continuous voice features through an algorithm, has higher precision of decoding results and has better understandability of synthesized audio. The method consists of a generator and a discriminator. The generator network consists of a dimension reduction module and a decoding module, and is respectively responsible for dimension reduction processing of the neural characteristics and generation of voice characteristics; the discriminator network is composed of stacked convolution layers and is responsible for determining the authenticity of the speech features. The generation countermeasure network algorithm adopted by the invention relieves the overcomplete of the voice characteristics, improves the accuracy of decoding the voice characteristics, and experiments show that the correlation coefficient of the decoding voice characteristics on a tested with French as a mother language is 0.51 on average, the mel cepstrum distortion is 4.64 on average, and the short-time objective intelligibility of the synthesized audio is 0.48 on average. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.
Drawings
Fig. 1 is a schematic diagram of a structure for generating an reactance network.
Fig. 2 is a graph of correlation coefficients for a portion of speech features.
Fig. 3 is a flow chart of an electroencephalogram signal voice decoding method based on a generation countermeasure network.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 3, the electroencephalogram signal voice decoding method based on the generation of the countermeasure network of the present invention comprises the following steps:
(1) Preprocessing the acquired original brain electrical signals by using a time-frequency analysis method, and extracting power spectral densities of different frequency bands of the nerve signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus, and mel-cepstrum features, fundamental frequency F0 and non-periodic parts of the tested voice signal which are used as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the voice signals of the public corpus and the mel-cepstrum features of the tested voice signals by using a specific public corpus through a dynamic time warping (Dynamic Time Wrapping, DTW) algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;
(4) The aligned neural characteristics are used as input data, the aligned voice characteristics and the tested electromagnetic joint contrast data are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;
(5) The neural characteristics of the test set are input into a trained generation countermeasure network model, decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic parts and EMA are obtained through calculation, and then the voice characteristics are synthesized into voice waveforms by using a vocoder.
In the preprocessing process in the step (1), firstly, a common mode median reference algorithm is used for filtering an original nerve signal; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.
The voice signals in the step (2) are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is read by a tested and repeated for a plurality of times, and are synchronously acquired with the nerve data through the nerve signal processor. Preprocessing a voice signal and a tested voice signal in the public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus and voice features (comprising the mel-cepstrum features, the fundamental frequency F0 and the aperiodic part) of the tested voice signal, and then aligning the voice features and the nerve features.
In the step (3), the EMA data process of the patient is inferred by using a DTW algorithm: the Mel cepstrum features of the speech signal of the public corpus are recorded as follows
Figure SMS_50
The Meier cepstrum characteristic of the acquired tested voice signal is +.>
Figure SMS_44
. Wherein (1)>
Figure SMS_54
Feature dimension representing mel-cepstrum feature, < >>
Figure SMS_51
And->
Figure SMS_58
Respectively->
Figure SMS_52
And->
Figure SMS_57
Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->
Figure SMS_47
And->
Figure SMS_59
Is->
Figure SMS_43
. Specifically, the->
Figure SMS_53
,/>
Figure SMS_45
Represents->
Figure SMS_55
Middle->
Figure SMS_49
Point stress/>
Figure SMS_60
Middle->
Figure SMS_46
The points are aligned.
Figure SMS_56
And->
Figure SMS_48
Is obtained by minimizing the following equation:
Figure SMS_61
wherein the method comprises the steps of
Figure SMS_62
Representing the length of the DTW path;
calculated to obtain
Figure SMS_63
After that, +.>
Figure SMS_64
Relative to->
Figure SMS_65
Is a sequence of alignment indexes of (a):
Figure SMS_66
where j is a temporary index. In this way the first and second light sources,
Figure SMS_67
representative sequence->
Figure SMS_68
And->
Figure SMS_69
Alignment time->
Figure SMS_70
Each point of the index sequence represents the sequence +.>
Figure SMS_71
And->
Figure SMS_72
Index at alignment. Interpolation is carried out on EMA data of the public corpus through an index sequence, and the EMA data is used as tested EMA data.
The generation reaction network for speech decoding in step (4) includes a generator and a arbiter. Wherein the generatorGThe system is used for decoding the neural characteristics into voice characteristics and EMA, and consists of a characteristic dimension reduction module and a decoding module; the discriminator is used for judging the authenticity of the generated voice characteristic and EMA, and comprises a voice discriminator
Figure SMS_73
And electromagnetic joint contrast data discriminator->
Figure SMS_74
For determining speech features and electromagnetic joint contrast data, respectively.
The specific process of training the generation of the countermeasure network is as follows: inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real target
Figure SMS_75
The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the real EMA is used as a condition, a voice characteristic part and a real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain a judgment score for generating the voice characteristic and the real voice characteristic, and a loss function between the judgment score and the label is calculated>
Figure SMS_76
The method comprises the steps of carrying out a first treatment on the surface of the Similarly, using the real speech feature as a condition, the EMA part and the real EMA in the decoding result are respectively input into the EMA discriminator to calculate the loss function +.>
Figure SMS_77
. The three partial loss functions are weighted and summed to obtain a total loss function +.>
Figure SMS_78
The network parameters are updated by back propagation of an Adam optimization algorithm, and iteration is sequentially carried out until a loss function is achieved
Figure SMS_79
And (5) convergence. Wherein, the liquid crystal display device comprises a liquid crystal display device,θrepresenting the network parameters of the entire generated countermeasure network.
Loss function
Figure SMS_80
The expression of (2) is as follows:
Figure SMS_81
Figure SMS_82
Figure SMS_83
Figure SMS_84
Figure SMS_85
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_87
representative generator->
Figure SMS_90
Representing any arbiter, c being the condition item of the current arbiter, +.>
Figure SMS_93
For real target data->
Figure SMS_88
For input neural characteristics, < >>
Figure SMS_91
For decoding result->
Figure SMS_94
Is a voice discriminator>
Figure SMS_95
For EMA discriminator, < >>
Figure SMS_86
Network parameters representing the entire generated countermeasure network, +.>
Figure SMS_89
Is->
Figure SMS_92
Is a weight of (2).
The invention can carry out voice decoding on the brain electrical signal and synthesize intelligible voice audio.
In combination with an example, the invention relates to an electroencephalogram signal voice decoding method based on an antagonism network generation, which comprises the following steps: (1) data acquisition. 97 phrases and 11 vowels are selected from the public corpus BY2014, the experiment requires the subjects to read the appointed vowels or phrases, and a neural signal processor of a Blackrock company is used for synchronously acquiring and recording neural signals and voice data in the experimental process. Each experiment time was about 30 minutes.
The specific flow of the data acquisition experiment is as follows: firstly, randomly selecting a short sentence or 2-3 vowel sequences from selected contents, and reading the screen contents aloud by a subject after the subject sees the screen contents; after that, the screen is randomly blacked out for a period of 0 to 1 s; subsequently, a cross prompt appears on the screen, and the subject needs to read the content before loud repetition at the moment; finally, the first step is restarted after the screen is again blacked out for a period of time within 1 s.
(2) Preprocessing the neural signals and voice data acquired in the step (1):
for neural data, first common mode median reference filtering is used on the original neural signal; then filtering the neural signal using a low pass filter with a cut-off frequency of 500 Hz; then downsampling the nerve signal from 30kHz to 2kHz; finally, a multi-window power spectrum estimation algorithm is used for extracting time-frequency characteristics from the nerve signals, the step length is 10ms, the extracted characteristics are divided into 21 frequency bands (0-10 Hz, 10-20Hz, 20-30Hz, …, 190-200Hz and 200-210 Hz) with the bandwidth of 10Hz according to the central frequency, and all time spectrums in each frequency band are added evenly to be used as the nerve characteristics of the frequency band.
For speech data, each sample is first sliced from the recorded continuous data stream. From the speech waveform, data from 500ms before speaking to 500ms after speaking in each experiment was taken as one sample. The same segmentation is required to be performed on the neural features according to the time labels while the voice data is segmented. Then, extracting 25-dimensional mel-cepstrum features and 1-dimensional fundamental frequency (pitch) from each voice sample by using an SPTK kit; then, the WORLD toolkit is used to extract 2-dimensional voice aperiodicity. A total of 28 dimensions of speech features were obtained, again with a step size of 10ms.
(3) EMA data for the subject is inferred using the DTW algorithm. The mel cepstrum features of the speech signal in the BY2014 corpus are recorded as
Figure SMS_103
The mel-cepstrum characteristic of the subject's speech signal obtained in step (2) is +.>
Figure SMS_96
. Wherein (1)>
Figure SMS_108
Feature dimension representing speech data, < >>
Figure SMS_100
And->
Figure SMS_107
Respectively->
Figure SMS_105
And->
Figure SMS_112
Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->
Figure SMS_99
And->
Figure SMS_109
Is->
Figure SMS_98
. Specifically, the->
Figure SMS_113
, />
Figure SMS_101
Represents->
Figure SMS_111
Middle->
Figure SMS_104
Point should and->
Figure SMS_110
Middle->
Figure SMS_97
The points are aligned. />
Figure SMS_106
And->
Figure SMS_102
Is obtained by minimizing the following equation:
Figure SMS_114
wherein the method comprises the steps of
Figure SMS_115
Representing the length of the DTW path;
calculated to obtain
Figure SMS_116
After that, +.>
Figure SMS_117
Relative to->
Figure SMS_118
Is a sequence of alignment indexes of (a):
Figure SMS_119
where j is a temporary index. In this way the first and second light sources,
Figure SMS_120
representative sequence->
Figure SMS_121
And->
Figure SMS_122
Alignment time->
Figure SMS_123
Each point of the index sequence represents the sequence +.>
Figure SMS_124
And->
Figure SMS_125
Index at alignment. Interpolation is carried out on EMA data in the BY2014 corpus through an index sequence, and the interpolation is used as tested EMA data.
(4) The neural characteristics, the voice characteristics and the tested EMA data obtained in the above steps are used for training to generate a countermeasure network SpeechGAN, the structure of which is shown in fig. 1, and for the training process:
first, neural time-frequency features are used
Figure SMS_126
Input to the generator->
Figure SMS_127
Is decoded by the speech feature->
Figure SMS_128
(including: mel-cepstrum, aperiodic excitation parameters and pitch) and electromagnetic angiography (EMA) data +.>
Figure SMS_129
Figure SMS_130
Then, respectively, the real voice data
Figure SMS_131
And EMA data->
Figure SMS_132
Is input to two condition discriminators (speech discriminator)
Figure SMS_133
And motion discriminator->
Figure SMS_134
Motion discriminator->
Figure SMS_135
I.e. electromagnetic angiography data arbiter):
Figure SMS_136
Figure SMS_137
obtained by
Figure SMS_138
And->
Figure SMS_139
The judgment values of the real voice data and the EMA data are respectively;
then, the generated voice data are respectively processed
Figure SMS_140
And EMA data->
Figure SMS_141
Input into two condition discriminators:
Figure SMS_142
Figure SMS_143
obtained by
Figure SMS_144
And->
Figure SMS_145
Respectively generating judgment values of voice data and EMA data;
calculating a loss function of the discriminator:
Figure SMS_146
Figure SMS_147
after the loss functions of the two discriminators are obtained through the method, the Adam optimizer is utilized to optimize the two discriminators
Figure SMS_148
Is updated. Speech feature to be decoded->
Figure SMS_149
And EMA data->
Figure SMS_150
Again input into the updated arbiter for decision:
Figure SMS_151
Figure SMS_152
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_153
and->
Figure SMS_154
Respectively representing the speech discriminator and the EMA discriminator after the update of the present round,/for each round>
Figure SMS_155
And->
Figure SMS_156
Respectively representing discrimination values obtained using the updated discriminators.
Calculating a loss function of the generator:
Figure SMS_157
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_158
representative generator->
Figure SMS_159
Weight lost for L2, +.>
Figure SMS_160
And true speech features and EMA, +.>
Figure SMS_161
For the generated speech features and EMA. The generator is obtained by the above method>
Figure SMS_162
After the loss function of the generator, updating the network weight of the generator by using an Adam optimizer to complete a round of training.
(5) The modem contains 634 samples in total, and 10 fold cross-validation is used to determine the decoding performance of the generated challenge network. In the training process, training data is further divided into a training set and a verification set, decoded mel cepstrum features M, fundamental frequency F0, aperiodic sections Amp and EMA (comprising horizontal positions of upper and lower lip mark points, vertical positions of upper and lower lip mark points and the like) are obtained through calculation, and an optimal model is determined through accuracy of decoding results of the verification set. Under 10-fold cross validation, the pearson correlation coefficient between the decoding result and the real data is calculated, the average correlation coefficient is 0.51, and the average mel cepstrum distortion is 4.64. Fig. 2 shows correlation coefficients between partial decoding results and real data, including: in EMA, the correlation coefficient of the features such as the upper lip mark point horizontal position Ux, the upper lip mark point vertical position Uy, the lower lip mark point horizontal position Dx, the lower lip mark point vertical position Dy, the fundamental frequency F0, the non-periodic part AMP1 feature and the mel-frequency cepstrum M0 feature can reach more than 0.8 at the highest.
(6) And (3) inputting the decoded mel-frequency cepstrum feature, F0 and aperiodicity obtained in the step (5) into a WORLD vocoder to synthesize a voice waveform. The average short-time objective intelligibility of the synthesized speech was 0.48. Comparing the real sample, the conventional GAN generated sample and the Mel frequency spectrum of the inventive (SpeechGAN) generated sample, the inventive generated Mel frequency spectrum is closer to the real sample than the conventional GAN generated sample, and the SpeechGAN has better characterization on Mel frequency spectrum, especially on high frequency part under the two conditions of vowels and sentences, so that the problem of over-smoothing can be effectively avoided, the voice reduction degree of the decoding result synthesis is higher, the decoding precision of the inventive method is better, and the synthesized audio is more understandable.
The invention develops the neural signal voice decoding method based on the generation countermeasure network, realizes the decoding and synthesis of continuous voice characteristics through the algorithm, has higher precision of decoding results and better comprehensibility of synthesized audio. Meanwhile, the generation countermeasure network algorithm adopted by the invention relieves the problem of overcorrection of voice characteristics.

Claims (7)

1. An electroencephalogram signal voice decoding method based on a generated countermeasure network is characterized by comprising the following steps:
(1) Preprocessing the acquired original electroencephalogram signals by using a time-frequency analysis method, and taking the power spectral densities of different frequency bands of the electroencephalogram signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus, and mel-cepstrum features, fundamental frequency F0 and non-periodic parts of the tested voice signal which are used as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals through a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by utilizing the electromagnetic joint radiography data in the public corpus;
(4) Taking the neural characteristic aligned in the step (2) as input data, taking the voice characteristic aligned in the step (2) and the electromagnetic angiography data tested in the step (3) as decoding objects together, inputting the voice characteristic and the electromagnetic angiography data into a generated countermeasure network model for training, and constructing a generated countermeasure network for voice decoding;
(5) Inputting the neural characteristics of the test set into the generating countermeasure network for voice decoding, calculating to obtain decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic part and electromagnetic joint contrast data, and synthesizing into voice waveforms by using a vocoder.
2. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (1), the pretreatment specifically includes:
firstly, filtering and downsampling the original electroencephalogram signals of each channel; then extracting the power spectrum density; and finally, dividing the extracted power spectral density into a plurality of frequency bands with different bandwidths according to the center frequency to obtain the power spectral densities of the different frequency bands of the electroencephalogram signal.
3. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:
the Mel cepstrum features of the speech signal of the public corpus are recorded as follows
Figure QLYQS_9
The mel-cepstrum characteristic of the tested speech signal is +.>
Figure QLYQS_6
Wherein->
Figure QLYQS_11
Feature dimension representing mel-cepstrum feature, < >>
Figure QLYQS_3
And->
Figure QLYQS_18
Respectively->
Figure QLYQS_8
And->
Figure QLYQS_14
Is recorded with the sequence length of dynamic time warping algorithm to generate +.>
Figure QLYQS_10
And->
Figure QLYQS_15
Is->
Figure QLYQS_1
Specifically, the->
Figure QLYQS_12
,/>
Figure QLYQS_4
Representative of
Figure QLYQS_17
Middle->
Figure QLYQS_5
Point should and->
Figure QLYQS_16
Middle->
Figure QLYQS_7
Alignment of individual points->
Figure QLYQS_13
And->
Figure QLYQS_2
Is obtained by minimizing the following equation:
Figure QLYQS_19
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_20
representing the length of the path of the dynamic time warping algorithm, calculated as +.>
Figure QLYQS_21
Then, get->
Figure QLYQS_22
Relative to->
Figure QLYQS_23
Ji Suoyin sequence->
Figure QLYQS_24
Figure QLYQS_25
Wherein j is a temporary index,
Figure QLYQS_26
representative sequence->
Figure QLYQS_27
And->
Figure QLYQS_28
Alignment time->
Figure QLYQS_29
Index by ∈ Ji Suoyin>
Figure QLYQS_30
Interpolation is carried out on the electromagnetic joint contrast data in the public corpus, and the tested electromagnetic joint contrast data are generated.
4. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in step (4), the generating the countermeasure network model includes: a generator and a arbiter.
5. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the generator is used for decoding the neural characteristics into voice characteristics and electromagnetic joint contrast data, and comprises a characteristic dimension reduction module and a decoding module.
6. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the discriminator is used for determining the authenticity of the generated voice feature and the electromagnetic joint contrast data, and comprises a voice discriminator and an electromagnetic joint contrast data discriminator which are respectively used for determining the voice feature and the electromagnetic joint contrast data.
7. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:
inputting samples into a generator one by one according to batches to obtain a decoding result output by the generator, and calculating a loss function between the decoding result and a real target
Figure QLYQS_31
The real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment score of the generated voice characteristic and the real voice characteristic, and the loss function between the judgment score and the label is calculated
Figure QLYQS_32
The real voice characteristic is used as a condition, electromagnetic joint contrast data and real electromagnetic joint contrast data in the decoding result are respectively input into an electromagnetic joint contrast data discriminator to calculate and obtain a loss function
Figure QLYQS_33
The three loss functions are weighted and summed to obtain the loss function
Figure QLYQS_34
The network parameters are updated by back propagation of an adaptive motion estimation optimization algorithm, and the network parameters are iterated in sequence until a loss function is +.>
Figure QLYQS_35
Convergence, wherein G represents the generator, D1 is the speech discriminator, D2 is the electromagnetic joint contrast data discriminator, and θ represents the network parameters of the whole generated countermeasure network.
CN202310220138.1A 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network Active CN116364096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310220138.1A CN116364096B (en) 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310220138.1A CN116364096B (en) 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network

Publications (2)

Publication Number Publication Date
CN116364096A true CN116364096A (en) 2023-06-30
CN116364096B CN116364096B (en) 2023-11-28

Family

ID=86933972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310220138.1A Active CN116364096B (en) 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network

Country Status (1)

Country Link
CN (1) CN116364096B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130490A (en) * 2023-10-26 2023-11-28 天津大学 Brain-computer interface control system, control method and implementation method thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001306A (en) * 2020-08-21 2020-11-27 西安交通大学 Electroencephalogram signal decoding method for generating neural network based on deep convolution countermeasure
CN113609988A (en) * 2021-08-06 2021-11-05 太原科技大学 End-to-end electroencephalogram signal decoding method for auditory induction
AU2021104767A4 (en) * 2021-07-31 2022-04-28 Kumar G S, Shashi Method for classification of human emotions based on selected scalp region eeg patterns by a neural network
CN115620751A (en) * 2022-10-14 2023-01-17 山西大学 Electroencephalogram signal prediction method based on speaker voice induction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001306A (en) * 2020-08-21 2020-11-27 西安交通大学 Electroencephalogram signal decoding method for generating neural network based on deep convolution countermeasure
AU2021104767A4 (en) * 2021-07-31 2022-04-28 Kumar G S, Shashi Method for classification of human emotions based on selected scalp region eeg patterns by a neural network
CN113609988A (en) * 2021-08-06 2021-11-05 太原科技大学 End-to-end electroencephalogram signal decoding method for auditory induction
CN115620751A (en) * 2022-10-14 2023-01-17 山西大学 Electroencephalogram signal prediction method based on speaker voice induction

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130490A (en) * 2023-10-26 2023-11-28 天津大学 Brain-computer interface control system, control method and implementation method thereof
CN117130490B (en) * 2023-10-26 2024-01-26 天津大学 Brain-computer interface control system, control method and implementation method thereof

Also Published As

Publication number Publication date
CN116364096B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
Ménard et al. Auditory normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood
Takaki et al. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis
Wang et al. Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation.
CN116364096B (en) Electroencephalogram signal voice decoding method based on generation countermeasure network
Yadav et al. Prosodic mapping using neural networks for emotion conversion in Hindi language
Shah et al. Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing
Wu et al. Deep Speech Synthesis from MRI-Based Articulatory Representations
Haque et al. Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech
Padmini et al. Age-Based Automatic Voice Conversion Using Blood Relation for Voice Impaired.
Takaki et al. Multiple feed-forward deep neural networks for statistical parametric speech synthesis
CN116092473A (en) Prosody annotation model, training method of prosody prediction model and related equipment
Narendra et al. Parameterization of excitation signal for improving the quality of HMM-based speech synthesis system
CN116092471A (en) Multi-style personalized Tibetan language speech synthesis model oriented to low-resource condition
CN114913844A (en) Broadcast language identification method for pitch normalization reconstruction
Krug et al. Articulatory synthesis for data augmentation in phoneme recognition
CN113436607A (en) Fast voice cloning method
Mansouri et al. Human Laughter Generation using Hybrid Generative Models.
Ilyes et al. Statistical parametric speech synthesis for Arabic language using ANN
Chandra et al. Towards The Development Of Accent Conversion Model For (L1) Bengali Speaker Using Cycle Consistent Adversarial Network (Cyclegan)
Toutios et al. Contribution to statistical acoustic-to-EMA mapping
Louw Neural speech synthesis for resource-scarce languages
Ng Survey of data-driven approaches to Speech Synthesis
Uslu et al. Turkish regional dialect recognition using acoustic features of voiced segments
Wang et al. Non-parallel Accent Transfer based on Fine-grained Controllable Accent Modelling
Wang et al. Artificial Intelligence and Machine Learning Application in NPP MCR Speech Monitoring System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant