CN116364096A - Electroencephalogram signal voice decoding method based on generation countermeasure network - Google Patents
Electroencephalogram signal voice decoding method based on generation countermeasure network Download PDFInfo
- Publication number
- CN116364096A CN116364096A CN202310220138.1A CN202310220138A CN116364096A CN 116364096 A CN116364096 A CN 116364096A CN 202310220138 A CN202310220138 A CN 202310220138A CN 116364096 A CN116364096 A CN 116364096A
- Authority
- CN
- China
- Prior art keywords
- voice
- decoding
- data
- mel
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 210000005036 nerve Anatomy 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 3
- 230000001537 neural effect Effects 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 17
- 239000004973 liquid crystal related substance Substances 0.000 claims description 12
- 238000002601 radiography Methods 0.000 claims description 10
- 238000001228 spectrum Methods 0.000 claims description 9
- 230000003595 spectral effect Effects 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 6
- 238000002583 angiography Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000000737 periodic effect Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 210000004556 brain Anatomy 0.000 abstract description 8
- 238000013507 mapping Methods 0.000 abstract description 4
- 238000009499 grossing Methods 0.000 abstract description 3
- 238000005868 electrolysis reaction Methods 0.000 abstract description 2
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000003042 antagnostic effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 101710170230 Antimicrobial peptide 1 Proteins 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The invention discloses an electroencephalogram signal voice decoding method based on a generation countermeasure network, which is used for carrying out voice decoding on electroencephalogram signals and synthesizing understandable vowel audio. According to the method, the tested electroencephalogram signals and the voice data which are synchronously collected are utilized, after effective preprocessing, the mapping relation from the anti-network learning electroencephalogram signals to the voice data is generated, so that the problem of over-smoothing can be effectively relieved, and the audio synthesized by decoding voice features has better intelligibility. The method consists of a generator and a discriminator. The generator is responsible for performing dimension reduction processing on the nerve characteristics and generating voice characteristics; the arbiter is responsible for determining the authenticity of the speech feature. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.
Description
Technical Field
The invention belongs to the technical field of electroencephalogram data analysis, and particularly relates to an electroencephalogram signal voice decoding method based on a generation countermeasure network (Generative Adversarial Networks, GAN).
Background
In recent years, the technology of speech decoding in the field of brain-computer interfaces is rapidly developed, and at present, the research of the speech brain-computer interfaces can be divided into two different technical routes. The first class is discrete decoding, which classifies recorded neural signals into corresponding phonetic representations, e.g., phonemes, text, etc., within a finite set of classes; the second type is continuous decoding, i.e. the neural signals are directly decoded into acoustic features or vocal organ motion trajectories corresponding to the speech. The difficulty of decoding the discrete voice features is relatively low, but phonemes or words with similar categories are difficult to distinguish, and the classification set is limited; the continuous voice feature decoding can realize the output of any voice mode, and can synthesize emotion voice through the features of intonation, pause and the like, but the method has the defects of difficult decoding and higher precision requirement on decoding results.
Currently, the most common methods for continuous speech feature decoding are neural networks, including Long Short-term Memory (LSTM), self-encoder (Autoencoder), deep neural networks (Deep Neural Network, DNN), etc. Thanks to the rapid development of deep learning, these neural network approaches achieve higher accuracy than traditional machine learning models. However, these methods still face problems, and the overcomplete of the decoded speech features is one of the important factors. The reason for this problem is that the distribution of the decoded speech features tends to converge near the mean value of the training data, resulting in loss of high-frequency information. The distribution situation of the fitting data of the countermeasure network can be better compared with the neural networks, and the problem of over-smoothing can be effectively relieved. This is of great value for improving the quality of speech decoding and the intelligibility of the synthesized audio.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for decoding brain electrical signals based on generating an antagonistic network, which uses the generated antagonistic network to learn the mapping relationship between brain electrical signals and voice data after effectively preprocessing by using the synchronously acquired tested brain electrical signals and voice data, so that the method can effectively alleviate the problem of overcomplete, and the audio synthesized by decoding voice features has better intelligibility.
An electroencephalogram signal voice decoding method based on a generated countermeasure network is realized through the following steps:
(1) Preprocessing the acquired original electroencephalogram signals by using a time-frequency analysis method, and taking the power spectral densities of different frequency bands of the electroencephalogram signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the voice signal of the public corpus, and the mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the tested voice signal serving as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals by using a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;
(4) The neural characteristic aligned in the step (2) is used as input data, the voice characteristic aligned in the step (2) and the electromagnetic joint contrast data tested in the step (3) are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;
(5) Inputting the neural characteristics of the test set into the generating countermeasure network for voice decoding, calculating to obtain decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic part and electromagnetic joint contrast data, and synthesizing into voice waveforms by using a vocoder.
In the step (1), the pretreatment specifically includes:
firstly, filtering and downsampling the original electroencephalogram signals of each channel; then extracting the power spectrum density; and finally, dividing the extracted power spectral density into a plurality of frequency bands with different bandwidths according to the center frequency to obtain the power spectral densities of the different frequency bands of the electroencephalogram signal.
Firstly, filtering an original nerve signal by using a common mode median reference algorithm; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.
The speech signals in step (2) include speech signals in the public corpus and speech signals that are. The tested voice signals are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is tested for repeated reading, and are synchronously collected with the nerve data through the nerve signal processor. And then preprocessing the voice signals in the public corpus and the tested voice signals by using a voice signal processing tool to obtain the mel-cepstrum features of the voice signals in the public corpus, and the mel-cepstrum features, the fundamental frequency F0 and the aperiodic parts of the tested voice signals as voice features, and then aligning the voice features and the nerve features.
In the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:
recording and disclosing speech signals of a corpusThe mel-frequency cepstrum is characterized in thatThe mel-cepstrum characteristic of the tested speech signal is +.>Wherein->Feature dimension representing mel-cepstrum feature, < >>And->Respectively->And->Is recorded with the sequence length of dynamic time warping algorithm to generate +.>And->Is->Specifically, the number of the cells, specifically,represents->Middle->Point should and->Middle->Pairs of pointsEven (I) at>And->Is obtained by minimizing the following equation:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the length of the path of the dynamic time warping algorithm, calculated as +.>Then, get->Relative to->Ji Suoyin sequence->:
Wherein j is a temporary index, the minimum number satisfying p (j) is equal to or larger than i is taken,representative sequence->And->Alignment time->Indexing, electromagnetic joint contrast data in the public corpus by means of Ji Suoyin sequenceAnd performing row interpolation to generate electromagnetic joint contrast data of a tested person.
In step (4), the generating the countermeasure network model includes: a generator and a arbiter;
the generator is provided withThe system comprises a feature dimension reduction module and a decoding module, wherein the feature dimension reduction module is used for decoding the neural features into voice features and electromagnetic joint contrast data;
the said discriminator is used for determining the authenticity of the generated voice characteristic and electromagnetic joint contrast data, and comprises a voice discriminatorAnd electromagnetic joint contrast data discriminator->For determining speech features and electromagnetic joint contrast data, respectively.
In the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:
inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real target;
Meanwhile, the real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment scores of the generated voice characteristic and the real voice characteristic, and then the loss function between the judgment scores and the labels is calculated;
The real voice characteristic is used as a condition, electromagnetic joint contrast data and real electromagnetic joint contrast data in the decoding result are respectively input into an electromagnetic joint contrast data discriminator to calculate and obtain a loss functionThe three parts of the loss function are weighted and summed to obtain the loss function +.>The network parameters are updated by back propagation of an adaptive motion estimation optimization algorithm, and the network parameters are iterated in sequence until a loss function is +.>Convergence, wherein->Representative generator->Is a voice discriminator>For electromagnetic joint contrast data discriminator, +.>Representing the network parameters of the entire generated countermeasure network.
In step (4), a speech discriminator is usedAnd electromagnetic joint contrast data discriminator->The two different targets in the voice decoding task are divided, so that the discrimination of the network on the motion parameters and the voice features is carried out separately, the closer mapping relation between the nerve features and the magnetic joint contrast data can be effectively utilized, the constraint and the supplement of the mapping between the nerve features and the voice features are realized, the learning capacity of the network can be improved, the high-frequency part of the voice features can be supplemented, more voice detail features are provided, and the voice decoding effect is improved.
Compared with the prior art, the invention has the following beneficial effects: the invention develops a neural signal voice decoding method based on a generated countermeasure network, carries out voice decoding on brain electricity, synthesizes understandable vowel audio, realizes the decoding and synthesis of continuous voice features through an algorithm, has higher precision of decoding results and has better understandability of synthesized audio. The method consists of a generator and a discriminator. The generator network consists of a dimension reduction module and a decoding module, and is respectively responsible for dimension reduction processing of the neural characteristics and generation of voice characteristics; the discriminator network is composed of stacked convolution layers and is responsible for determining the authenticity of the speech features. The generation countermeasure network algorithm adopted by the invention relieves the overcomplete of the voice characteristics, improves the accuracy of decoding the voice characteristics, and experiments show that the correlation coefficient of the decoding voice characteristics on a tested with French as a mother language is 0.51 on average, the mel cepstrum distortion is 4.64 on average, and the short-time objective intelligibility of the synthesized audio is 0.48 on average. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.
Drawings
Fig. 1 is a schematic diagram of a structure for generating an reactance network.
Fig. 2 is a graph of correlation coefficients for a portion of speech features.
Fig. 3 is a flow chart of an electroencephalogram signal voice decoding method based on a generation countermeasure network.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 3, the electroencephalogram signal voice decoding method based on the generation of the countermeasure network of the present invention comprises the following steps:
(1) Preprocessing the acquired original brain electrical signals by using a time-frequency analysis method, and extracting power spectral densities of different frequency bands of the nerve signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus, and mel-cepstrum features, fundamental frequency F0 and non-periodic parts of the tested voice signal which are used as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the voice signals of the public corpus and the mel-cepstrum features of the tested voice signals by using a specific public corpus through a dynamic time warping (Dynamic Time Wrapping, DTW) algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;
(4) The aligned neural characteristics are used as input data, the aligned voice characteristics and the tested electromagnetic joint contrast data are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;
(5) The neural characteristics of the test set are input into a trained generation countermeasure network model, decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic parts and EMA are obtained through calculation, and then the voice characteristics are synthesized into voice waveforms by using a vocoder.
In the preprocessing process in the step (1), firstly, a common mode median reference algorithm is used for filtering an original nerve signal; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.
The voice signals in the step (2) are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is read by a tested and repeated for a plurality of times, and are synchronously acquired with the nerve data through the nerve signal processor. Preprocessing a voice signal and a tested voice signal in the public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus and voice features (comprising the mel-cepstrum features, the fundamental frequency F0 and the aperiodic part) of the tested voice signal, and then aligning the voice features and the nerve features.
In the step (3), the EMA data process of the patient is inferred by using a DTW algorithm: the Mel cepstrum features of the speech signal of the public corpus are recorded as followsThe Meier cepstrum characteristic of the acquired tested voice signal is +.>. Wherein (1)>Feature dimension representing mel-cepstrum feature, < >>And->Respectively->And->Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->And->Is->. Specifically, the->,/>Represents->Middle->Point stress/>Middle->The points are aligned.And->Is obtained by minimizing the following equation:
where j is a temporary index. In this way the first and second light sources,representative sequence->And->Alignment time->Each point of the index sequence represents the sequence +.>And->Index at alignment. Interpolation is carried out on EMA data of the public corpus through an index sequence, and the EMA data is used as tested EMA data.
The generation reaction network for speech decoding in step (4) includes a generator and a arbiter. Wherein the generatorGThe system is used for decoding the neural characteristics into voice characteristics and EMA, and consists of a characteristic dimension reduction module and a decoding module; the discriminator is used for judging the authenticity of the generated voice characteristic and EMA, and comprises a voice discriminatorAnd electromagnetic joint contrast data discriminator->For determining speech features and electromagnetic joint contrast data, respectively.
The specific process of training the generation of the countermeasure network is as follows: inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real targetThe method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the real EMA is used as a condition, a voice characteristic part and a real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain a judgment score for generating the voice characteristic and the real voice characteristic, and a loss function between the judgment score and the label is calculated>The method comprises the steps of carrying out a first treatment on the surface of the Similarly, using the real speech feature as a condition, the EMA part and the real EMA in the decoding result are respectively input into the EMA discriminator to calculate the loss function +.>. The three partial loss functions are weighted and summed to obtain a total loss function +.>The network parameters are updated by back propagation of an Adam optimization algorithm, and iteration is sequentially carried out until a loss function is achievedAnd (5) convergence. Wherein, the liquid crystal display device comprises a liquid crystal display device,θrepresenting the network parameters of the entire generated countermeasure network.
wherein, the liquid crystal display device comprises a liquid crystal display device,representative generator->Representing any arbiter, c being the condition item of the current arbiter, +.>For real target data->For input neural characteristics, < >>For decoding result->Is a voice discriminator>For EMA discriminator, < >>Network parameters representing the entire generated countermeasure network, +.>Is->Is a weight of (2).
The invention can carry out voice decoding on the brain electrical signal and synthesize intelligible voice audio.
In combination with an example, the invention relates to an electroencephalogram signal voice decoding method based on an antagonism network generation, which comprises the following steps: (1) data acquisition. 97 phrases and 11 vowels are selected from the public corpus BY2014, the experiment requires the subjects to read the appointed vowels or phrases, and a neural signal processor of a Blackrock company is used for synchronously acquiring and recording neural signals and voice data in the experimental process. Each experiment time was about 30 minutes.
The specific flow of the data acquisition experiment is as follows: firstly, randomly selecting a short sentence or 2-3 vowel sequences from selected contents, and reading the screen contents aloud by a subject after the subject sees the screen contents; after that, the screen is randomly blacked out for a period of 0 to 1 s; subsequently, a cross prompt appears on the screen, and the subject needs to read the content before loud repetition at the moment; finally, the first step is restarted after the screen is again blacked out for a period of time within 1 s.
(2) Preprocessing the neural signals and voice data acquired in the step (1):
for neural data, first common mode median reference filtering is used on the original neural signal; then filtering the neural signal using a low pass filter with a cut-off frequency of 500 Hz; then downsampling the nerve signal from 30kHz to 2kHz; finally, a multi-window power spectrum estimation algorithm is used for extracting time-frequency characteristics from the nerve signals, the step length is 10ms, the extracted characteristics are divided into 21 frequency bands (0-10 Hz, 10-20Hz, 20-30Hz, …, 190-200Hz and 200-210 Hz) with the bandwidth of 10Hz according to the central frequency, and all time spectrums in each frequency band are added evenly to be used as the nerve characteristics of the frequency band.
For speech data, each sample is first sliced from the recorded continuous data stream. From the speech waveform, data from 500ms before speaking to 500ms after speaking in each experiment was taken as one sample. The same segmentation is required to be performed on the neural features according to the time labels while the voice data is segmented. Then, extracting 25-dimensional mel-cepstrum features and 1-dimensional fundamental frequency (pitch) from each voice sample by using an SPTK kit; then, the WORLD toolkit is used to extract 2-dimensional voice aperiodicity. A total of 28 dimensions of speech features were obtained, again with a step size of 10ms.
(3) EMA data for the subject is inferred using the DTW algorithm. The mel cepstrum features of the speech signal in the BY2014 corpus are recorded asThe mel-cepstrum characteristic of the subject's speech signal obtained in step (2) is +.>. Wherein (1)>Feature dimension representing speech data, < >>And->Respectively->And->Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->And->Is->. Specifically, the->, />Represents->Middle->Point should and->Middle->The points are aligned. />And->Is obtained by minimizing the following equation:
where j is a temporary index. In this way the first and second light sources,representative sequence->And->Alignment time->Each point of the index sequence represents the sequence +.>And->Index at alignment. Interpolation is carried out on EMA data in the BY2014 corpus through an index sequence, and the interpolation is used as tested EMA data.
(4) The neural characteristics, the voice characteristics and the tested EMA data obtained in the above steps are used for training to generate a countermeasure network SpeechGAN, the structure of which is shown in fig. 1, and for the training process:
first, neural time-frequency features are usedInput to the generator->Is decoded by the speech feature->(including: mel-cepstrum, aperiodic excitation parameters and pitch) and electromagnetic angiography (EMA) data +.>:
Then, respectively, the real voice dataAnd EMA data->Is input to two condition discriminators (speech discriminator)And motion discriminator->Motion discriminator->I.e. electromagnetic angiography data arbiter):
then, the generated voice data are respectively processedAnd EMA data->Input into two condition discriminators:
calculating a loss function of the discriminator:
after the loss functions of the two discriminators are obtained through the method, the Adam optimizer is utilized to optimize the two discriminatorsIs updated. Speech feature to be decoded->And EMA data->Again input into the updated arbiter for decision:
wherein, the liquid crystal display device comprises a liquid crystal display device,and->Respectively representing the speech discriminator and the EMA discriminator after the update of the present round,/for each round>And->Respectively representing discrimination values obtained using the updated discriminators.
Calculating a loss function of the generator:
wherein, the liquid crystal display device comprises a liquid crystal display device,representative generator->Weight lost for L2, +.>And true speech features and EMA, +.>For the generated speech features and EMA. The generator is obtained by the above method>After the loss function of the generator, updating the network weight of the generator by using an Adam optimizer to complete a round of training.
(5) The modem contains 634 samples in total, and 10 fold cross-validation is used to determine the decoding performance of the generated challenge network. In the training process, training data is further divided into a training set and a verification set, decoded mel cepstrum features M, fundamental frequency F0, aperiodic sections Amp and EMA (comprising horizontal positions of upper and lower lip mark points, vertical positions of upper and lower lip mark points and the like) are obtained through calculation, and an optimal model is determined through accuracy of decoding results of the verification set. Under 10-fold cross validation, the pearson correlation coefficient between the decoding result and the real data is calculated, the average correlation coefficient is 0.51, and the average mel cepstrum distortion is 4.64. Fig. 2 shows correlation coefficients between partial decoding results and real data, including: in EMA, the correlation coefficient of the features such as the upper lip mark point horizontal position Ux, the upper lip mark point vertical position Uy, the lower lip mark point horizontal position Dx, the lower lip mark point vertical position Dy, the fundamental frequency F0, the non-periodic part AMP1 feature and the mel-frequency cepstrum M0 feature can reach more than 0.8 at the highest.
(6) And (3) inputting the decoded mel-frequency cepstrum feature, F0 and aperiodicity obtained in the step (5) into a WORLD vocoder to synthesize a voice waveform. The average short-time objective intelligibility of the synthesized speech was 0.48. Comparing the real sample, the conventional GAN generated sample and the Mel frequency spectrum of the inventive (SpeechGAN) generated sample, the inventive generated Mel frequency spectrum is closer to the real sample than the conventional GAN generated sample, and the SpeechGAN has better characterization on Mel frequency spectrum, especially on high frequency part under the two conditions of vowels and sentences, so that the problem of over-smoothing can be effectively avoided, the voice reduction degree of the decoding result synthesis is higher, the decoding precision of the inventive method is better, and the synthesized audio is more understandable.
The invention develops the neural signal voice decoding method based on the generation countermeasure network, realizes the decoding and synthesis of continuous voice characteristics through the algorithm, has higher precision of decoding results and better comprehensibility of synthesized audio. Meanwhile, the generation countermeasure network algorithm adopted by the invention relieves the problem of overcorrection of voice characteristics.
Claims (7)
1. An electroencephalogram signal voice decoding method based on a generated countermeasure network is characterized by comprising the following steps:
(1) Preprocessing the acquired original electroencephalogram signals by using a time-frequency analysis method, and taking the power spectral densities of different frequency bands of the electroencephalogram signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus, and mel-cepstrum features, fundamental frequency F0 and non-periodic parts of the tested voice signal which are used as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals through a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by utilizing the electromagnetic joint radiography data in the public corpus;
(4) Taking the neural characteristic aligned in the step (2) as input data, taking the voice characteristic aligned in the step (2) and the electromagnetic angiography data tested in the step (3) as decoding objects together, inputting the voice characteristic and the electromagnetic angiography data into a generated countermeasure network model for training, and constructing a generated countermeasure network for voice decoding;
(5) Inputting the neural characteristics of the test set into the generating countermeasure network for voice decoding, calculating to obtain decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic part and electromagnetic joint contrast data, and synthesizing into voice waveforms by using a vocoder.
2. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (1), the pretreatment specifically includes:
firstly, filtering and downsampling the original electroencephalogram signals of each channel; then extracting the power spectrum density; and finally, dividing the extracted power spectral density into a plurality of frequency bands with different bandwidths according to the center frequency to obtain the power spectral densities of the different frequency bands of the electroencephalogram signal.
3. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:
the Mel cepstrum features of the speech signal of the public corpus are recorded as followsThe mel-cepstrum characteristic of the tested speech signal is +.>Wherein->Feature dimension representing mel-cepstrum feature, < >>And->Respectively->And->Is recorded with the sequence length of dynamic time warping algorithm to generate +.>And->Is->Specifically, the->,/>Representative ofMiddle->Point should and->Middle->Alignment of individual points->And->Is obtained by minimizing the following equation:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the length of the path of the dynamic time warping algorithm, calculated as +.>Then, get->Relative to->Ji Suoyin sequence->:
4. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in step (4), the generating the countermeasure network model includes: a generator and a arbiter.
5. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the generator is used for decoding the neural characteristics into voice characteristics and electromagnetic joint contrast data, and comprises a characteristic dimension reduction module and a decoding module.
6. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the discriminator is used for determining the authenticity of the generated voice feature and the electromagnetic joint contrast data, and comprises a voice discriminator and an electromagnetic joint contrast data discriminator which are respectively used for determining the voice feature and the electromagnetic joint contrast data.
7. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:
inputting samples into a generator one by one according to batches to obtain a decoding result output by the generator, and calculating a loss function between the decoding result and a real target;
The real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment score of the generated voice characteristic and the real voice characteristic, and the loss function between the judgment score and the label is calculated;
The real voice characteristic is used as a condition, electromagnetic joint contrast data and real electromagnetic joint contrast data in the decoding result are respectively input into an electromagnetic joint contrast data discriminator to calculate and obtain a loss function;
The three loss functions are weighted and summed to obtain the loss functionThe network parameters are updated by back propagation of an adaptive motion estimation optimization algorithm, and the network parameters are iterated in sequence until a loss function is +.>Convergence, wherein G represents the generator, D1 is the speech discriminator, D2 is the electromagnetic joint contrast data discriminator, and θ represents the network parameters of the whole generated countermeasure network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310220138.1A CN116364096B (en) | 2023-03-09 | 2023-03-09 | Electroencephalogram signal voice decoding method based on generation countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310220138.1A CN116364096B (en) | 2023-03-09 | 2023-03-09 | Electroencephalogram signal voice decoding method based on generation countermeasure network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116364096A true CN116364096A (en) | 2023-06-30 |
CN116364096B CN116364096B (en) | 2023-11-28 |
Family
ID=86933972
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310220138.1A Active CN116364096B (en) | 2023-03-09 | 2023-03-09 | Electroencephalogram signal voice decoding method based on generation countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116364096B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117130490A (en) * | 2023-10-26 | 2023-11-28 | 天津大学 | Brain-computer interface control system, control method and implementation method thereof |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001306A (en) * | 2020-08-21 | 2020-11-27 | 西安交通大学 | Electroencephalogram signal decoding method for generating neural network based on deep convolution countermeasure |
CN113609988A (en) * | 2021-08-06 | 2021-11-05 | 太原科技大学 | End-to-end electroencephalogram signal decoding method for auditory induction |
AU2021104767A4 (en) * | 2021-07-31 | 2022-04-28 | Kumar G S, Shashi | Method for classification of human emotions based on selected scalp region eeg patterns by a neural network |
CN115620751A (en) * | 2022-10-14 | 2023-01-17 | 山西大学 | Electroencephalogram signal prediction method based on speaker voice induction |
-
2023
- 2023-03-09 CN CN202310220138.1A patent/CN116364096B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001306A (en) * | 2020-08-21 | 2020-11-27 | 西安交通大学 | Electroencephalogram signal decoding method for generating neural network based on deep convolution countermeasure |
AU2021104767A4 (en) * | 2021-07-31 | 2022-04-28 | Kumar G S, Shashi | Method for classification of human emotions based on selected scalp region eeg patterns by a neural network |
CN113609988A (en) * | 2021-08-06 | 2021-11-05 | 太原科技大学 | End-to-end electroencephalogram signal decoding method for auditory induction |
CN115620751A (en) * | 2022-10-14 | 2023-01-17 | 山西大学 | Electroencephalogram signal prediction method based on speaker voice induction |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117130490A (en) * | 2023-10-26 | 2023-11-28 | 天津大学 | Brain-computer interface control system, control method and implementation method thereof |
CN117130490B (en) * | 2023-10-26 | 2024-01-26 | 天津大学 | Brain-computer interface control system, control method and implementation method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN116364096B (en) | 2023-11-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ménard et al. | Auditory normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood | |
Takaki et al. | A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis | |
Wang et al. | Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation. | |
CN116364096B (en) | Electroencephalogram signal voice decoding method based on generation countermeasure network | |
Yadav et al. | Prosodic mapping using neural networks for emotion conversion in Hindi language | |
Shah et al. | Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing | |
Wu et al. | Deep Speech Synthesis from MRI-Based Articulatory Representations | |
Haque et al. | Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech | |
Padmini et al. | Age-Based Automatic Voice Conversion Using Blood Relation for Voice Impaired. | |
Takaki et al. | Multiple feed-forward deep neural networks for statistical parametric speech synthesis | |
CN116092473A (en) | Prosody annotation model, training method of prosody prediction model and related equipment | |
Narendra et al. | Parameterization of excitation signal for improving the quality of HMM-based speech synthesis system | |
CN116092471A (en) | Multi-style personalized Tibetan language speech synthesis model oriented to low-resource condition | |
CN114913844A (en) | Broadcast language identification method for pitch normalization reconstruction | |
Krug et al. | Articulatory synthesis for data augmentation in phoneme recognition | |
CN113436607A (en) | Fast voice cloning method | |
Mansouri et al. | Human Laughter Generation using Hybrid Generative Models. | |
Ilyes et al. | Statistical parametric speech synthesis for Arabic language using ANN | |
Chandra et al. | Towards The Development Of Accent Conversion Model For (L1) Bengali Speaker Using Cycle Consistent Adversarial Network (Cyclegan) | |
Toutios et al. | Contribution to statistical acoustic-to-EMA mapping | |
Louw | Neural speech synthesis for resource-scarce languages | |
Ng | Survey of data-driven approaches to Speech Synthesis | |
Uslu et al. | Turkish regional dialect recognition using acoustic features of voiced segments | |
Wang et al. | Non-parallel Accent Transfer based on Fine-grained Controllable Accent Modelling | |
Wang et al. | Artificial Intelligence and Machine Learning Application in NPP MCR Speech Monitoring System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |