CN116364096B - Electroencephalogram signal voice decoding method based on generation countermeasure network - Google Patents

Electroencephalogram signal voice decoding method based on generation countermeasure network Download PDF

Info

Publication number
CN116364096B
CN116364096B CN202310220138.1A CN202310220138A CN116364096B CN 116364096 B CN116364096 B CN 116364096B CN 202310220138 A CN202310220138 A CN 202310220138A CN 116364096 B CN116364096 B CN 116364096B
Authority
CN
China
Prior art keywords
voice
decoding
data
mel
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310220138.1A
Other languages
Chinese (zh)
Other versions
CN116364096A (en
Inventor
张韶岷
刘腾俊
冉星辰
万子俊
李悦
郑能干
陈卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310220138.1A priority Critical patent/CN116364096B/en
Publication of CN116364096A publication Critical patent/CN116364096A/en
Application granted granted Critical
Publication of CN116364096B publication Critical patent/CN116364096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses an electroencephalogram signal voice decoding method based on a generation countermeasure network, which is used for carrying out voice decoding on electroencephalogram signals and synthesizing understandable vowel audio. According to the method, the tested electroencephalogram signals and the voice data which are synchronously collected are utilized, after effective preprocessing, the mapping relation from the anti-network learning electroencephalogram signals to the voice data is generated, so that the problem of over-smoothing can be effectively relieved, and the audio synthesized by decoding voice features has better intelligibility. The method consists of a generator and a discriminator. The generator is responsible for performing dimension reduction processing on the nerve characteristics and generating voice characteristics; the arbiter is responsible for determining the authenticity of the speech feature. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.

Description

Electroencephalogram signal voice decoding method based on generation countermeasure network
Technical Field
The invention belongs to the technical field of electroencephalogram data analysis, and particularly relates to an electroencephalogram signal voice decoding method based on a generation countermeasure network (Generative Adversarial Networks, GAN).
Background
In recent years, the technology of speech decoding in the field of brain-computer interfaces is rapidly developed, and at present, the research of the speech brain-computer interfaces can be divided into two different technical routes. The first class is discrete decoding, which classifies recorded neural signals into corresponding phonetic representations, e.g., phonemes, text, etc., within a finite set of classes; the second type is continuous decoding, i.e. the neural signals are directly decoded into acoustic features or vocal organ motion trajectories corresponding to the speech. The difficulty of decoding the discrete voice features is relatively low, but phonemes or words with similar categories are difficult to distinguish, and the classification set is limited; the continuous voice feature decoding can realize the output of any voice mode, and can synthesize emotion voice through the features of intonation, pause and the like, but the method has the defects of difficult decoding and higher precision requirement on decoding results.
Currently, the most common methods for continuous speech feature decoding are neural networks, including Long Short-term Memory (LSTM), self-encoder (Autoencoder), deep neural networks (Deep Neural Network, DNN), etc. Thanks to the rapid development of deep learning, these neural network approaches achieve higher accuracy than traditional machine learning models. However, these methods still face problems, and the overcomplete of the decoded speech features is one of the important factors. The reason for this problem is that the distribution of the decoded speech features tends to converge near the mean value of the training data, resulting in loss of high-frequency information. The distribution situation of the fitting data of the countermeasure network can be better compared with the neural networks, and the problem of over-smoothing can be effectively relieved. This is of great value for improving the quality of speech decoding and the intelligibility of the synthesized audio.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for decoding brain electrical signals based on generating an antagonistic network, which uses the generated antagonistic network to learn the mapping relationship between brain electrical signals and voice data after effectively preprocessing by using the synchronously acquired tested brain electrical signals and voice data, so that the method can effectively alleviate the problem of overcomplete, and the audio synthesized by decoding voice features has better intelligibility.
An electroencephalogram signal voice decoding method based on a generated countermeasure network is realized through the following steps:
(1) Preprocessing the acquired original electroencephalogram signals by using a time-frequency analysis method, and taking the power spectral densities of different frequency bands of the electroencephalogram signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the voice signal of the public corpus, and the mel-cepstrum features, fundamental frequency F0 and aperiodic parts of the tested voice signal serving as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals by using a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;
(4) The neural characteristic aligned in the step (2) is used as input data, the voice characteristic aligned in the step (2) and the electromagnetic joint contrast data tested in the step (3) are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;
(5) Inputting the neural characteristics of the test set into the generating countermeasure network for voice decoding, calculating to obtain decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic part and electromagnetic joint contrast data, and synthesizing into voice waveforms by using a vocoder.
In the step (1), the pretreatment specifically includes:
firstly, filtering and downsampling the original electroencephalogram signals of each channel; then extracting the power spectrum density; and finally, dividing the extracted power spectral density into a plurality of frequency bands with different bandwidths according to the center frequency to obtain the power spectral densities of the different frequency bands of the electroencephalogram signal.
Firstly, filtering an original nerve signal by using a common mode median reference algorithm; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.
The speech signals in step (2) include speech signals in the public corpus and speech signals that are. The tested voice signals are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is tested for repeated reading, and are synchronously collected with the nerve data through the nerve signal processor. And then preprocessing the voice signals in the public corpus and the tested voice signals by using a voice signal processing tool to obtain the mel-cepstrum features of the voice signals in the public corpus, and the mel-cepstrum features, the fundamental frequency F0 and the aperiodic parts of the tested voice signals as voice features, and then aligning the voice features and the nerve features.
In the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:
the Mel cepstrum features of the speech signal of the public corpus are recorded as followsThe mel-cepstrum characteristic of the tested speech signal is +.>Wherein->Feature dimension representing mel-cepstrum feature, < >>And->Respectively->And->Is recorded with the sequence length of dynamic time warping algorithm to generate +.>And->Is->Specifically, the number of the cells, specifically,represents->Middle->Point should and->Middle->Alignment of individual points->And->Is obtained by minimizing the following equation:
wherein,representing the length of the path of the dynamic time warping algorithm, calculated as +.>Then, get->Relative to->Ji Suoyin sequence->
Wherein j is a temporary index, the minimum number satisfying p (j) is equal to or larger than i is taken,representative sequence->And->Alignment time->And (3) indexing, namely interpolating the electromagnetic joint contrast data in the public corpus by using the Ji Suoyin sequence to generate tested electromagnetic joint contrast data.
In step (4), the generating the countermeasure network model includes: a generator and a arbiter;
the generator is provided withThe system comprises a feature dimension reduction module and a decoding module, wherein the feature dimension reduction module is used for decoding the neural features into voice features and electromagnetic joint contrast data;
the said discriminator is used for determining the authenticity of the generated voice characteristic and electromagnetic joint contrast data, and comprises a voice discriminatorAnd electromagnetic joint contrast data discriminator->For determining speech characteristics and electromagnetism, respectivelyAngiography data.
In the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:
inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real target
Meanwhile, the real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment scores of the generated voice characteristic and the real voice characteristic, and then the loss function between the judgment scores and the labels is calculated
The real voice characteristic is used as a condition, electromagnetic joint contrast data and real electromagnetic joint contrast data in the decoding result are respectively input into an electromagnetic joint contrast data discriminator to calculate and obtain a loss functionThe three parts of the loss function are weighted and summed to obtain the loss function +.>The network parameters are updated by back propagation of an adaptive motion estimation optimization algorithm, and the network parameters are iterated in sequence until a loss function is +.>Convergence, wherein->Representative generator->Is a voice discriminator>For electromagnetic joint constructionShadow data discriminator->Representing the network parameters of the entire generated countermeasure network.
In step (4), a speech discriminator is usedAnd electromagnetic joint contrast data discriminator->The two different targets in the voice decoding task are divided, so that the discrimination of the network on the motion parameters and the voice features is carried out separately, the closer mapping relation between the nerve features and the magnetic joint contrast data can be effectively utilized, the constraint and the supplement of the mapping between the nerve features and the voice features are realized, the learning capacity of the network can be improved, the high-frequency part of the voice features can be supplemented, more voice detail features are provided, and the voice decoding effect is improved.
Compared with the prior art, the invention has the following beneficial effects: the invention develops a neural signal voice decoding method based on a generated countermeasure network, carries out voice decoding on brain electricity, synthesizes understandable vowel audio, realizes the decoding and synthesis of continuous voice features through an algorithm, has higher precision of decoding results and has better understandability of synthesized audio. The method consists of a generator and a discriminator. The generator network consists of a dimension reduction module and a decoding module, and is respectively responsible for dimension reduction processing of the neural characteristics and generation of voice characteristics; the discriminator network is composed of stacked convolution layers and is responsible for determining the authenticity of the speech features. The generation countermeasure network algorithm adopted by the invention relieves the overcomplete of the voice characteristics, improves the accuracy of decoding the voice characteristics, and experiments show that the correlation coefficient of the decoding voice characteristics on a tested with French as a mother language is 0.51 on average, the mel cepstrum distortion is 4.64 on average, and the short-time objective intelligibility of the synthesized audio is 0.48 on average. The invention has the characteristics of high decoding precision and strong understandability of decoding single phonemes. In addition, compared with the existing voice decoding algorithm invention, the invention obviously improves the phenomenon of overcomplete existing in the voice characteristics of the brain electrolysis code.
Drawings
Fig. 1 is a schematic diagram of a structure for generating an reactance network.
Fig. 2 is a graph of correlation coefficients for a portion of speech features.
Fig. 3 is a flow chart of an electroencephalogram signal voice decoding method based on a generation countermeasure network.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 3, the electroencephalogram signal voice decoding method based on the generation of the countermeasure network of the present invention comprises the following steps:
(1) Preprocessing the acquired original brain electrical signals by using a time-frequency analysis method, and extracting power spectral densities of different frequency bands of the nerve signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus, and mel-cepstrum features, fundamental frequency F0 and non-periodic parts of the tested voice signal which are used as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the voice signals of the public corpus and the mel-cepstrum features of the tested voice signals by using a specific public corpus through a dynamic time warping (Dynamic Time Wrapping, DTW) algorithm, and deducing the tested electromagnetic joint radiography data by using the electromagnetic joint radiography data in the public corpus;
(4) The aligned neural characteristics are used as input data, the aligned voice characteristics and the tested electromagnetic joint contrast data are used as decoding objects together, and the decoding objects are input into a generating countermeasure network model for training, so that a generating countermeasure network for voice decoding is constructed;
(5) The neural characteristics of the test set are input into a trained generation countermeasure network model, decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic parts and EMA are obtained through calculation, and then the voice characteristics are synthesized into voice waveforms by using a vocoder.
In the preprocessing process in the step (1), firstly, a common mode median reference algorithm is used for filtering an original nerve signal; the neural signal is then filtered (preventing aliasing effects) and downsampled using a low pass filter; then extracting the time-frequency characteristic of each channel neural signal by using a multi-window power spectrum estimation algorithm; and finally, dividing the extracted time-frequency characteristics into a plurality of frequency bands with different bandwidths according to the central frequency.
The voice signals in the step (2) are appointed vowels or phrases (not including lip reading or silently reading samples) in the public corpus which is read by a tested and repeated for a plurality of times, and are synchronously acquired with the nerve data through the nerve signal processor. Preprocessing a voice signal and a tested voice signal in the public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus and voice features (comprising the mel-cepstrum features, the fundamental frequency F0 and the aperiodic part) of the tested voice signal, and then aligning the voice features and the nerve features.
In the step (3), the EMA data process of the patient is inferred by using a DTW algorithm: the Mel cepstrum features of the speech signal of the public corpus are recorded as followsThe Meier cepstrum characteristic of the acquired tested voice signal is +.>. Wherein (1)>Feature dimension representing mel-cepstrum feature, < >>And->Respectively->And->Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->And->Is->. Specifically, the->,/>Represents->Middle->Point should and->Middle->The points are aligned. />And->Is obtained by minimizing the following equation:
wherein the method comprises the steps ofRepresentation ofThe length of the DTW path;
calculated to obtainAfter that, +.>Relative to->Is a sequence of alignment indexes of (a):
where j is a temporary index. In this way the first and second light sources,representative sequence->And->Alignment time->Each point of the index sequence represents the sequence +.>And->Index at alignment. Interpolation is carried out on EMA data of the public corpus through an index sequence, and the EMA data is used as tested EMA data.
The generation reaction network for speech decoding in step (4) includes a generator and a arbiter. Wherein the generatorGThe system is used for decoding the neural characteristics into voice characteristics and EMA, and consists of a characteristic dimension reduction module and a decoding module; the discriminator is used for judging the authenticity of the generated voice characteristic and EMA, and comprises a voice discriminatorAnd electromagnetic joint contrast data discriminator->For determining speech features and electromagnetic joint contrast data, respectively.
The specific process of training the generation of the countermeasure network is as follows: inputting samples into a generator one by one according to batches to obtain a decoding result predicted and output by the generator, and calculating the loss between the decoding result and a real targetThe method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the real EMA is used as a condition, a voice characteristic part and a real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain a judgment score for generating the voice characteristic and the real voice characteristic, and a loss function between the judgment score and the label is calculated>The method comprises the steps of carrying out a first treatment on the surface of the Similarly, using the real speech feature as a condition, the EMA part and the real EMA in the decoding result are respectively input into the EMA discriminator to calculate the loss function +.>. The three partial loss functions are weighted and summed to obtain a total loss function +.>The network parameters are updated by back propagation of an Adam optimization algorithm, and iteration is sequentially carried out until a loss function is achievedAnd (5) convergence. Wherein,θrepresenting the network parameters of the entire generated countermeasure network.
Loss functionThe expression of (2) is as follows:
wherein,representative generator->Representing any arbiter, c being the condition item of the current arbiter, +.>For real target data->For input neural characteristics, < >>For decoding result->Is a voice discriminator>For EMA discriminator, < >>Network parameters representing the entire generated countermeasure network, +.>Is->Is a weight of (2).
The invention can carry out voice decoding on the brain electrical signal and synthesize intelligible voice audio.
In combination with an example, the invention relates to an electroencephalogram signal voice decoding method based on an antagonism network generation, which comprises the following steps: (1) data acquisition. 97 phrases and 11 vowels are selected from the public corpus BY2014, the experiment requires the subjects to read the appointed vowels or phrases, and a neural signal processor of a Blackrock company is used for synchronously acquiring and recording neural signals and voice data in the experimental process. Each experiment time was about 30 minutes.
The specific flow of the data acquisition experiment is as follows: firstly, randomly selecting a short sentence or 2-3 vowel sequences from selected contents, and reading the screen contents aloud by a subject after the subject sees the screen contents; after that, the screen is randomly blacked out for a period of 0 to 1 s; subsequently, a cross prompt appears on the screen, and the subject needs to read the content before loud repetition at the moment; finally, the first step is restarted after the screen is again blacked out for a period of time within 1 s.
(2) Preprocessing the neural signals and voice data acquired in the step (1):
for neural data, first common mode median reference filtering is used on the original neural signal; then filtering the neural signal using a low pass filter with a cut-off frequency of 500 Hz; then downsampling the nerve signal from 30kHz to 2kHz; finally, a multi-window power spectrum estimation algorithm is used for extracting time-frequency characteristics from the nerve signals, the step length is 10ms, the extracted characteristics are divided into 21 frequency bands (0-10 Hz, 10-20Hz, 20-30Hz, …, 190-200Hz and 200-210 Hz) with the bandwidth of 10Hz according to the central frequency, and all time spectrums in each frequency band are added evenly to be used as the nerve characteristics of the frequency band.
For speech data, each sample is first sliced from the recorded continuous data stream. From the speech waveform, data from 500ms before speaking to 500ms after speaking in each experiment was taken as one sample. The same segmentation is required to be performed on the neural features according to the time labels while the voice data is segmented. Then, extracting 25-dimensional mel-cepstrum features and 1-dimensional fundamental frequency (pitch) from each voice sample by using an SPTK kit; then, the WORLD toolkit is used to extract 2-dimensional voice aperiodicity. A total of 28 dimensions of speech features were obtained, again with a step size of 10ms.
(3) EMA data for the subject is inferred using the DTW algorithm. The mel cepstrum features of the speech signal in the BY2014 corpus are recorded asThe mel-cepstrum characteristic of the subject's speech signal obtained in step (2) is +.>. Wherein (1)>Feature dimension representing speech data, < >>And->Respectively->And->Is a sequence length of (a) in a sequence. Recording DTW algorithm generation->And->Is->. Specifically, the->, />Represents->Middle->Point should and->Middle->The points are aligned. />And->Is obtained by minimizing the following equation:
wherein the method comprises the steps ofRepresenting the length of the DTW path;
calculated to obtainAfter that, +.>Relative to->Is a sequence of alignment indexes of (a):
where j is a temporary index. In this way the first and second light sources,representative sequence->And->Alignment time->Each point of the index sequence represents the sequence +.>And->Index at alignment. Interpolation is carried out on EMA data in the BY2014 corpus through an index sequence, and the interpolation is used as tested EMA data.
(4) The neural characteristics, the voice characteristics and the tested EMA data obtained in the above steps are used for training to generate a countermeasure network SpeechGAN, the structure of which is shown in fig. 1, and for the training process:
first, neural time-frequency features are usedInput to the generator->Is decoded by the speech feature->(including: mel-cepstrum, aperiodic excitation parameters and pitch) and electromagnetic angiography (EMA) data +.>
Then, respectively, the real voice dataAnd EMA data->Is input to two condition discriminators (speech discriminator)And motion discriminator->Motion discriminator->I.e. electromagnetic angiography data arbiter):
obtained byAnd->The judgment values of the real voice data and the EMA data are respectively;
then, the generated voice data are respectively processedAnd EMA data->Input into two condition discriminators:
obtained byAnd->Respectively generating judgment values of voice data and EMA data;
calculating a loss function of the discriminator:
after the loss functions of the two discriminators are obtained through the method, the Adam optimizer is utilized to optimize the two discriminatorsIs updated. Speech feature to be decoded->And EMA data->Again input into the updated arbiter for decision:
wherein,and->Speech discriminators respectively representing updated cyclesAnd EMA discriminator,>and->Respectively representing discrimination values obtained using the updated discriminators.
Calculating a loss function of the generator:
wherein,representative generator->Weight lost for L2, +.>And true speech features and EMA, +.>For the generated speech features and EMA. The generator is obtained by the above method>After the loss function of the generator, updating the network weight of the generator by using an Adam optimizer to complete a round of training.
(5) The modem contains 634 samples in total, and 10 fold cross-validation is used to determine the decoding performance of the generated challenge network. In the training process, training data is further divided into a training set and a verification set, decoded mel cepstrum features M, fundamental frequency F0, aperiodic sections Amp and EMA (comprising horizontal positions of upper and lower lip mark points, vertical positions of upper and lower lip mark points and the like) are obtained through calculation, and an optimal model is determined through accuracy of decoding results of the verification set. Under 10-fold cross validation, the pearson correlation coefficient between the decoding result and the real data is calculated, the average correlation coefficient is 0.51, and the average mel cepstrum distortion is 4.64. Fig. 2 shows correlation coefficients between partial decoding results and real data, including: in EMA, the correlation coefficient of the features such as the upper lip mark point horizontal position Ux, the upper lip mark point vertical position Uy, the lower lip mark point horizontal position Dx, the lower lip mark point vertical position Dy, the fundamental frequency F0, the non-periodic part AMP1 feature and the mel-frequency cepstrum M0 feature can reach more than 0.8 at the highest.
(6) And (3) inputting the decoded mel-frequency cepstrum feature, F0 and aperiodicity obtained in the step (5) into a WORLD vocoder to synthesize a voice waveform. The average short-time objective intelligibility of the synthesized speech was 0.48. Comparing the real sample, the conventional GAN generated sample and the Mel frequency spectrum of the inventive (SpeechGAN) generated sample, the inventive generated Mel frequency spectrum is closer to the real sample than the conventional GAN generated sample, and the SpeechGAN has better characterization on Mel frequency spectrum, especially on high frequency part under the two conditions of vowels and sentences, so that the problem of over-smoothing can be effectively avoided, the voice reduction degree of the decoding result synthesis is higher, the decoding precision of the inventive method is better, and the synthesized audio is more understandable.
The invention develops the neural signal voice decoding method based on the generation countermeasure network, realizes the decoding and synthesis of continuous voice characteristics through the algorithm, has higher precision of decoding results and better comprehensibility of synthesized audio. Meanwhile, the generation countermeasure network algorithm adopted by the invention relieves the problem of overcorrection of voice characteristics.

Claims (7)

1. An electroencephalogram signal voice decoding method based on a generated countermeasure network is characterized by comprising the following steps:
(1) Preprocessing the acquired original electroencephalogram signals by using a time-frequency analysis method, and taking the power spectral densities of different frequency bands of the electroencephalogram signals as nerve characteristics;
(2) Preprocessing a voice signal and a tested voice signal in a public corpus by using a voice signal processing tool to obtain mel-cepstrum features of the voice signal of the public corpus, and mel-cepstrum features, fundamental frequency F0 and non-periodic parts of the tested voice signal which are used as voice features, and then aligning the voice features and nerve features;
(3) Establishing an optimal alignment path of the mel-cepstrum features of the speech signals of the public corpus and the mel-cepstrum features of the tested speech signals through a dynamic time warping algorithm, and deducing the tested electromagnetic joint radiography data by utilizing the electromagnetic joint radiography data in the public corpus;
(4) Taking the neural characteristic aligned in the step (2) as input data, taking the voice characteristic aligned in the step (2) and the electromagnetic angiography data tested in the step (3) as decoding objects together, inputting the voice characteristic and the electromagnetic angiography data into a generated countermeasure network model for training, and constructing a generated countermeasure network for voice decoding;
(5) Inputting the neural characteristics of the test set into the generating countermeasure network for voice decoding, calculating to obtain decoded mel cepstrum characteristics, fundamental frequency F0, aperiodic part and electromagnetic joint contrast data, and synthesizing into voice waveforms by using a vocoder.
2. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (1), the pretreatment specifically includes:
firstly, filtering and downsampling the original electroencephalogram signals of each channel; then extracting the power spectrum density; and finally, dividing the extracted power spectral density into a plurality of frequency bands with different bandwidths according to the center frequency to obtain the power spectral densities of the different frequency bands of the electroencephalogram signal.
3. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (3), an optimal alignment path of the mel-cepstrum feature of the speech signal of the public corpus and the mel-cepstrum feature of the tested speech signal is established through a dynamic time warping algorithm, and then the tested electromagnetic joint radiography data is deduced by utilizing the electromagnetic joint radiography data in the public corpus, which comprises the following steps:
the Mel cepstrum features of the speech signal of the public corpus are recorded as followsThe mel-cepstrum characteristic of the tested speech signal is +.>Wherein->Feature dimension representing mel-cepstrum feature, < >>And->Respectively->And->Is recorded with the sequence length of dynamic time warping algorithm to generate +.>And->Is->Specifically, the->,/>Represents->Middle->Point should and->Middle->Alignment of individual points->And->Is obtained by minimizing the following equation:
wherein,representing the length of the path of the dynamic time warping algorithm, calculated as +.>Then, get->Relative to->Ji Suoyin sequence->
Wherein j is a temporary index,representative sequence->And->Alignment time->Index by ∈ Ji Suoyin>Interpolation is carried out on the electromagnetic joint contrast data in the public corpus, and the tested electromagnetic joint contrast data are generated.
4. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in step (4), the generating the countermeasure network model includes: a generator and a arbiter.
5. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the generator is used for decoding the neural characteristics into voice characteristics and electromagnetic joint contrast data, and comprises a characteristic dimension reduction module and a decoding module.
6. The electroencephalogram signal speech decoding method according to claim 4, characterized in that: in the step (4), the discriminator is used for determining the authenticity of the generated voice feature and the electromagnetic joint contrast data, and comprises a voice discriminator and an electromagnetic joint contrast data discriminator which are respectively used for determining the voice feature and the electromagnetic joint contrast data.
7. The electroencephalogram signal speech decoding method according to claim 1, characterized in that: in the step (4), the training is performed by inputting the training data into the generation of the countermeasure network model, and the training method specifically comprises the following steps:
inputting samples into a generator one by one according to batches to obtain a decoding result output by the generator, and calculating the decoding result and realityLoss function between targets
The real electromagnetic joint contrast data is used as a condition, the voice characteristic part and the real voice characteristic in the decoding result are respectively input into a voice discriminator to obtain the judgment score of the generated voice characteristic and the real voice characteristic, and the loss function between the judgment score and the label is calculated
The real voice characteristic is used as a condition, electromagnetic joint contrast data and real electromagnetic joint contrast data in the decoding result are respectively input into an electromagnetic joint contrast data discriminator to calculate and obtain a loss function
The three loss functions are weighted and summed to obtain the loss functionThe network parameters are updated by back propagation of an adaptive motion estimation optimization algorithm, and the network parameters are iterated in sequence until a loss function is +.>Convergence, wherein G represents the generator, D1 is the speech discriminator, D2 is the electromagnetic joint contrast data discriminator, and θ represents the network parameters of the whole generated countermeasure network.
CN202310220138.1A 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network Active CN116364096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310220138.1A CN116364096B (en) 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310220138.1A CN116364096B (en) 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network

Publications (2)

Publication Number Publication Date
CN116364096A CN116364096A (en) 2023-06-30
CN116364096B true CN116364096B (en) 2023-11-28

Family

ID=86933972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310220138.1A Active CN116364096B (en) 2023-03-09 2023-03-09 Electroencephalogram signal voice decoding method based on generation countermeasure network

Country Status (1)

Country Link
CN (1) CN116364096B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130490B (en) * 2023-10-26 2024-01-26 天津大学 Brain-computer interface control system, control method and implementation method thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001306A (en) * 2020-08-21 2020-11-27 西安交通大学 Electroencephalogram signal decoding method for generating neural network based on deep convolution countermeasure
CN113609988A (en) * 2021-08-06 2021-11-05 太原科技大学 End-to-end electroencephalogram signal decoding method for auditory induction
AU2021104767A4 (en) * 2021-07-31 2022-04-28 Kumar G S, Shashi Method for classification of human emotions based on selected scalp region eeg patterns by a neural network
CN115620751A (en) * 2022-10-14 2023-01-17 山西大学 Electroencephalogram signal prediction method based on speaker voice induction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001306A (en) * 2020-08-21 2020-11-27 西安交通大学 Electroencephalogram signal decoding method for generating neural network based on deep convolution countermeasure
AU2021104767A4 (en) * 2021-07-31 2022-04-28 Kumar G S, Shashi Method for classification of human emotions based on selected scalp region eeg patterns by a neural network
CN113609988A (en) * 2021-08-06 2021-11-05 太原科技大学 End-to-end electroencephalogram signal decoding method for auditory induction
CN115620751A (en) * 2022-10-14 2023-01-17 山西大学 Electroencephalogram signal prediction method based on speaker voice induction

Also Published As

Publication number Publication date
CN116364096A (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
Wang et al. Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation.
CN102473416A (en) Voice quality conversion device, method therefor, vowel information generating device, and voice quality conversion system
CN116364096B (en) Electroencephalogram signal voice decoding method based on generation countermeasure network
Wu et al. Deep speech synthesis from articulatory representations
Yadav et al. Prosodic mapping using neural networks for emotion conversion in Hindi language
Shah et al. Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing
CN114842878A (en) Speech emotion recognition method based on neural network
Wu et al. Deep Speech Synthesis from MRI-Based Articulatory Representations
Haque et al. Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech
Saheer et al. Combining vocal tract length normalization with hierarchical linear transformations
Padmini et al. Age-Based Automatic Voice Conversion Using Blood Relation for Voice Impaired.
Alrehaili et al. Arabic Speech Dialect Classification using Deep Learning
Narendra et al. Parameterization of excitation signal for improving the quality of HMM-based speech synthesis system
CN114913844A (en) Broadcast language identification method for pitch normalization reconstruction
Krug et al. Articulatory synthesis for data augmentation in phoneme recognition
Chit et al. Myanmar continuous speech recognition system using convolutional neural network
CN113436607A (en) Fast voice cloning method
Erro et al. On combining statistical methods and frequency warping for high-quality voice conversion
Wang et al. Analysis of Chinese interrogative intonation and its synthesis in HMM-Based synthesis system
Mansouri et al. Human Laughter Generation using Hybrid Generative Models.
Chandra et al. Towards The Development Of Accent Conversion Model For (L1) Bengali Speaker Using Cycle Consistent Adversarial Network (Cyclegan)
Ilyes et al. Statistical parametric speech synthesis for Arabic language using ANN
Louw Neural speech synthesis for resource-scarce languages
Bae et al. Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant