CN114550701A - Deep neural network-based Chinese electronic larynx voice conversion device and method - Google Patents

Deep neural network-based Chinese electronic larynx voice conversion device and method Download PDF

Info

Publication number
CN114550701A
CN114550701A CN202210180441.9A CN202210180441A CN114550701A CN 114550701 A CN114550701 A CN 114550701A CN 202210180441 A CN202210180441 A CN 202210180441A CN 114550701 A CN114550701 A CN 114550701A
Authority
CN
China
Prior art keywords
voice
speech
module
electronic larynx
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210180441.9A
Other languages
Chinese (zh)
Inventor
李明
史尧
杨耀根
张昊哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Duke Kunshan University
Original Assignee
Duke Kunshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duke Kunshan University filed Critical Duke Kunshan University
Priority to CN202210180441.9A priority Critical patent/CN114550701A/en
Publication of CN114550701A publication Critical patent/CN114550701A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a Chinese electronic larynx voice conversion device and method based on a deep neural network, the device comprises: the voice recognition module is used for extracting a derived linguistic feature sequence from the electronic larynx voice to serve as the electronic larynx voice bottleneck layer feature; the bottleneck characteristic mapping module is used for estimating corresponding normal voice bottleneck layer characteristics according to the electronic larynx voice bottleneck layer characteristics extracted from the electronic larynx voice, and recovering the tone and other linguistic information which are lacked in the electronic larynx voice; the acoustic feature reconstruction module is used for synthesizing the estimated normal bottleneck layer features into an acoustic feature sequence corresponding to the specific voice; a vocoder module for extracting acoustic features using mel-frequency spectrum of voice as the acoustic features and reconstructing voice signal from the acoustic features. The present invention uses speech conversion techniques to improve or enhance the intelligibility and naturalness of electronic throat speech from a software post-processing perspective.

Description

Deep neural network-based Chinese electronic larynx voice conversion device and method
Technical Field
The invention relates to the field of intelligent voice processing, in particular to a Chinese electronic larynx voice conversion device and method based on a deep neural network.
Background
Speech ability is one of the bases of ginseng and social activities, and there are many patients who have no larynx after total laryngectomy due to laryngeal diseases such as laryngeal cancer. Such patients lose the ability to produce sound autonomously due to the loss of the entire vocal cords, and need to rely on electronic throat (EL) products instead of vocal cords to vibrate for sound production. When the patient tries to speak, the electronic laryngeal device is abutted against the vicinity of the mandibular laryngeal prominence, and the mechanical vibration produced by the device is coordinated with the voluntary oral action of the patient to produce a sound.
However, the existing electronic throat product still has two obvious defects: one is that, different from the vocal cord vibration mode of the natural change in the healthy voice process, the vibration that the electronic larynx equipment produced is single for the pronunciation that produces under its assistance lacks natural tone information. This results in intelligibility and naturalness of electronic larynx speech being significantly lower than normal speech; secondly, electronic throat products rely on mechanical vibration to work, and part of energy is directly dissipated in the form of noise without human vocal tract adjustment in the process. The interference of this part of the mechanical noise further affects the naturalness of the electronic throat speech. In sum, existing electronic larynx products do not completely solve the problem of speech communication impairment for patients without larynx, due to the significantly lower speech quality than generally expected.
Speech conversion techniques are a class of intelligent speech processing techniques aimed at changing some aspect of the idiosyncrasies in speech. Research has been conducted in the past to demonstrate the feasibility of enhancing the speech of the electronic larynx by modeling the "electronic larynx-normal speech" pairing by speech conversion techniques to achieve improved intelligibility and naturalness.
Such techniques consider, at an early stage, analysis and synthesis of a set of speech features decomposed by Digital Signal Processing (DSP) techniques, such as fundamental frequency (F0), Spectral Envelope (Spectral Envelope), aperiodic feature (AP), and derivatives thereof; then, modeling a feature mapping relation between the electronic larynx and the normal voice by using a traditional statistical machine learning method comprising a Gaussian Mixture Model (GMM) and Non-Negative Matrix Factorization (NMF); finally, enhanced speech synthesis is performed using a Vocoder (Vocoder) based on signal processing means such as STRAIGHT, WORLD, and the like. Aiming at the problems of lack of fundamental frequency components and lack of opposition of clearness and turbidity in electronic larynx voices, researchers also propose that the fundamental frequency of the enhanced voices is mapped by introducing Phoneme Posterior Probability (PPP) features derived from a voice recognition model. However, with the development of deep learning technology in the field of intelligent speech processing, the modeling method based on traditional statistical machine learning and the vocoder designed based on signal processing theory all show their deficiencies in modeling capability, speech quality, etc.
Disclosure of Invention
Aiming at the technical problems, the invention aims to overcome at least one of the defects or shortcomings of the existing electronic larynx product in terms of sound production effect and the processing quality of electronic larynx voice enhancement based on the traditional statistical learning method; the invention aims to provide a Chinese electronic larynx speech conversion method based on a deep learning technology, which improves or enhances the intelligibility and the naturalness of electronic larynx speech from the perspective of software post-processing by using a speech conversion technology.
The technical scheme of the invention is as follows:
according to an embodiment of the present invention, the present invention provides a chinese electronic larynx speech conversion device based on a deep neural network, including the following sub-modules:
the speech recognition module is used for extracting a derived linguistic feature sequence from the electronic larynx speech as a bottleneck layer feature;
the bottleneck characteristic mapping module is used for estimating the corresponding normal voice bottleneck layer characteristics according to the bottleneck layer characteristics extracted from the electronic larynx voice and predicting the tone and/or other linguistic information lacking in the electronic larynx voice;
the acoustic feature reconstruction module is used for synthesizing the estimated normal bottleneck layer features into an acoustic feature sequence corresponding to the specific voice;
a vocoder module for extracting acoustic features using mel-frequency spectrum of voice as the acoustic features and reconstructing voice signal from the acoustic features.
In the above technical solution, the voice recognition module, the bottleneck characteristic mapping module, the acoustic characteristic reconstruction module, and the vocoder module are optimized independently using respective required data, and a serial form of "e-larynx voice-e-larynx bottleneck layer characteristic-normal bottleneck layer characteristic-acoustic characteristic-enhanced voice" is formed at a test stage, thereby completing conversion from e-larynx voice to enhanced voice with high intelligibility and naturalness.
In the above technical solution, the speech recognition module is structurally divided into an acoustic model and a decoder, wherein the acoustic model adopts a multilayer neural network to receive speech signal input and generate linguistic features of the speech signal; and the decoder receives the input of the linguistic features and outputs a speech recognition result, wherein the linguistic features are supervised phoneme N-gram posterior probability or implicit expression of a neural network without an explicit structure, and the bottleneck layer features refer to the forward propagation activation information of the acoustic model output layer or the similar layer.
In the technical scheme, the vocoder module comprises an acoustic feature extraction submodule and a voice signal reconstruction submodule, the acoustic feature extraction submodule extracts acoustic features by adopting an extraction algorithm of signal pre-emphasis, short-time Fourier transform, Mel filtering and numerical statistic processing, and the voice signal reconstruction submodule reconstructs voice signals by adopting a neural network vocoder designed based on a deep learning technology.
In the above technical solution, the acoustic feature reconstruction module uses a multi-speaker voice conversion model to model multiple reconstructed timbres to generate multiple timbres.
In the technical scheme, the bottleneck characteristic mapping module adopts a pre-training-fine-tuning two-stage training mode, a software simulation method is used in the pre-training stage to convert large-scale normal voice corpora into simulated electronic larynx voices with constant excitation signals, and the simulated electronic larynx-normal parallel corpora are used for pre-training; and in the fine adjustment stage, real parallel corpora obtained through recording are used for training fine adjustment.
According to another embodiment of the present invention, the present invention provides a method for converting a chinese electronic larynx speech based on a deep neural network, which is implemented based on the above apparatus for converting a chinese electronic larynx speech, and includes the following steps:
step S1: data acquisition is carried out by purposefully acquiring electronic larynx parallel corpora or using an internet existing corpus;
step S2: training a voice recognition module by using large-scale voice recognition corpora;
step S3: performing joint training on the bottleneck layer mapping module and the acoustic feature reconstruction module by using the electronic larynx-normal voice parallel corpus;
step S4: training the vocoder module by using a generation countermeasure network technology by using a Wiimer spectrum as an acoustic feature;
step S5: testing the voice recognition module, the bottleneck layer feature mapping module, the acoustic feature reconstruction module and the vocoder module by connecting input and output data of the voice recognition module, the bottleneck layer feature mapping module, the acoustic feature reconstruction module and the vocoder module in series to finish the conversion from electronic larynx voice to enhanced voice;
step S6: and performing an evaluation test on the test result.
In the above technical solution, step S2 includes:
step S201: preparing data in an existing corpus to finish data preprocessing;
step S202: training a single-phoneme and three-phoneme duration model, and performing forced alignment on the speech and text pairs of training data;
step S203: an acoustic model is trained using triphone alignment information.
In the above technical solution, step S3 includes:
step S301: taking electronic larynx parallel linguistic data as training data pairing, aligning by using a dynamic time warping technology, and respectively extracting bottleneck layer characteristics and acoustic characteristics;
step S302: matching the aligned and parallel electronic larynx speech bottleneck layer characteristics with the normal speech bottleneck layer characteristics, and training a neural network model of a bottleneck layer mapping module to be convergent by using a regression supervised learning method;
step S303: and (3) using aligned and parallel normal voice bottleneck layer feature-normal voice acoustic feature pairing, and using a regression and supervised learning method to train a neural network model of the acoustic feature reconstruction module to be converged.
In the above technical solution, step S4 includes the following steps:
step S401: preparing corpus data and finishing data preprocessing;
step S402: and (3) simultaneously and alternately optimizing the generator and the discriminator by using a generative counterlearning technology, wherein the optimization goal of the generator is to make the classifier classification result wrong, and the optimization goal of the discriminator is to correctly distinguish normal voice from generator reconstructed voice.
In the above technical solution, step S5 includes the following steps:
step S501: collecting electronic larynx voices to obtain test input voice samples;
step S502: calculating the MFCC characteristics of the input voice sample, inputting the MFCC characteristics into an acoustic model neural network of a voice recognition module, and taking information bottleneck layer activation information of the MFCC characteristics as the bottleneck layer characteristics of the electronic larynx voice;
step S503: inputting the bottleneck characteristic of the electronic larynx voice into a bottleneck characteristic mapping module, and obtaining the estimated normal voice bottleneck characteristic through forward propagation;
step S504: inputting the normal voice bottleneck characteristics obtained by estimation into an acoustic characteristic reconstruction model, and obtaining Mel spectral characteristics corresponding to the enhanced voice through forward propagation;
step S505: inputting the reconstructed Mel spectrum characteristics into a generator neural network of the MelGAN vocoder module, and obtaining the electronic larynx enhanced voice signal through forward propagation.
Compared with the prior art, the invention has the following beneficial effects:
the method is realized based on the deep neural network technology, and has stronger data modeling capability and better enhancement effect on the electronic larynx voice. The improvement can be proved by objective and subjective evaluation experiment comparison. In addition, through proper module segmentation, only a bottleneck characteristic mapping module needs to rely on an electronic larynx-normal voice parallel corpus which is not easy to obtain to realize training, and meanwhile, attributes such as expression learning of linguistic characteristics, tone color and tone quality of reconstructed voice and the like are separated from enhancement of the linguistic characteristics; allowing the respective module to be trained or pre-trained on a large scale of suitable data sets.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic diagram of a chinese electronic larynx speech conversion device according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a speech recognition module with training and testing stages according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a bottleneck feature mapping module in training and testing phases according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an acoustic feature reconstruction module with training and testing phases according to an embodiment of the present invention;
fig. 5 is a schematic diagram of the vocoder module training and testing stages according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
According to fig. 1, the present invention provides a chinese electronic larynx speech conversion device based on a deep neural network, the chinese electronic larynx speech conversion device comprising:
a speech recognition module, which is used for extracting a derived linguistic Feature sequence from the electronic larynx speech as a Bottleneck layer Feature (BNF);
the bottleneck characteristic mapping module is used for estimating the corresponding normal voice bottleneck layer characteristics according to the bottleneck layer characteristics extracted from the electronic larynx voice and predicting the tone and/or other linguistic information lacking in the electronic larynx voice;
the acoustic feature reconstruction module is used for synthesizing the estimated normal bottleneck layer features into an acoustic feature sequence corresponding to the specific voice;
a vocoder module for extracting acoustic features using mel-frequency spectrum of voice as the acoustic features and reconstructing voice signal from the acoustic features.
The four modules can be independently optimized by using the data required by each module, and a series connection form of 'electronic larynx voice-electronic larynx bottleneck layer characteristics-normal bottleneck layer characteristics-acoustic characteristics-enhanced voice' is formed in the testing stage, so that the conversion from the electronic larynx voice to the enhanced voice with high intelligibility and naturalness is completed.
In the present invention, the speech recognition module can be structurally divided into two parts, an acoustic model and a decoder. The acoustic model adopts a multilayer neural network to receive voice signal input and generate linguistic characteristics of the voice signal; the decoder receives the linguistic feature input and outputs a voice recognition result. The above-mentioned linguistic features are supervised phoneme N-gram Posterior probabilities (telephonic stereo-probability/stereo gram, PPP/PPG) or neural network implicit representations (Hidden representations) without explicit structure in jointly optimized acoustic models and decoders. Depending on the particular speech recognition system, the two sub-modules may be jointly optimized or may be trained separately. The bottleneck layer characteristics refer to activation information of an acoustic model output layer or a similar layer of a speech recognition system in forward propagation, and include but are not limited to specially designed bottleneck layer activation information positioned in front of the output layer.
The bottleneck feature mapping module and the acoustic feature reconstruction module adopt a Forward-Transformer (FFT) stack based on a Self-attention mechanism (Self-attention mechanism) and a Residual Connection (Residual Connection) as a model design structure of the neural network. In the training process of supervised learning, the whole speech segment is taken as the basic unit for calculating the regression Loss Function (Loss Function) of the speech segment. The acoustic feature reconstruction module models a plurality of reconstructed timbres using a multi-speaker model to produce a plurality of timbres.
The vocoder module uses the Mel spectrum of the voice as the acoustic features, so the vocoder module comprises an acoustic feature extraction sub-module and a voice signal reconstruction sub-module, the acoustic feature extraction sub-module adopts extraction algorithms of signal pre-emphasis (Preemphasis), Short Time Fourier Transform (STFT), Mel-frequency Filtering (Mel-frequency Filtering) and numerical statistical processing to extract the acoustic features, and the voice signal reconstruction sub-module adopts a neural network vocoder designed based on the deep learning technology to reconstruct the voice signal. Typical designs include MelGAN, HiFi-GAN, WaveRNN, etc., and other structures that perform equivalent functions may be used instead.
In the specific implementation of the present invention, the speech recognition module, the bottleneck feature mapping module, the acoustic feature reconstruction module, and the vocoder module are optimized by using a data set suitable for the requirements of each module in the process of discrete training. Firstly, training of a voice recognition module is carried out by using normal voice-transcription text pairing data of large-scale and multi-speaker; secondly, the acoustic feature reconstruction and vocoder module uses the voice corpus with high tone quality to train, which is beneficial to improving the tone quality and the natural degree of the reconstructed voice; thirdly, the bottleneck characteristic mapping module adopts a pre-training-fine-tuning two-stage training mode; in the pre-training stage, a simulated electronic larynx-normal voice parallel corpus obtained by a WORLD vocoder software simulation method is used for training; only in the fine-tuning phase is the pair of "electronic larynx-normal speech" parallel corpora necessary for training. The design compresses the requirements of the whole scheme on the quantity and quality of the electronic throat parallel corpora which are not easy to obtain. On the basis, the bottleneck layer mapping module and the acoustic feature reconstruction module are jointly trained by using the electronic larynx-normal voice parallel corpus, so that the effect of electronic larynx voice enhancement of the whole system can be improved.
In the specific embodiment of the invention, the bottleneck characteristic mapping module adopts a pre-training-fine-tuning two-stage training mode, a software simulation method based on a WORLD vocoder is used for converting large-scale normal voice corpora into simulated electronic larynx voice with constant fundamental frequency in the pre-training stage, and the simulated electronic larynx-normal parallel corpora are used for pre-training; and in the fine adjustment stage, real parallel corpora obtained through recording are used for training fine adjustment.
The software simulation method of electronic larynx voice based on WORLD vocoder mainly comprises the following steps:
step 1: decomposing normal voice into three items of fundamental frequency, spectral feature (SP) and aperiodic feature by using a feature extraction module in a WORLD vocoder;
step 2: given constant FelSetting the fundamental frequency of normal speech to be constant F for the typical fundamental frequency of electronic larynx speechel
And step 3: using synthesis module of WORLD vocoder to convert original audio spectrum characteristic, non-periodic characteristic and constant FelThe fundamental frequencies of the analog electronic larynx are synthesized into analog electronic larynx voices.
More specifically, as shown in fig. 2, the whole solution of the present embodiment can be divided into two stages: a training phase and a testing phase. Wherein the completion of the training is a prerequisite for the test to be performed. The training phase comprises training of neural network models in four main modules (including a pre-training phase and a fine-tuning phase) and joint training of series modules. The testing stage is a practical working stage of the invention, and in the testing stage, the electronic larynx speech is converted into the enhanced speech with obviously improved intelligibility and naturalness by connecting the input interface and the output interface of the four main modules in series.
Specifically, the invention provides a Chinese electronic larynx speech conversion method based on a deep neural network, which comprises the following steps:
step S1: data acquisition is carried out by purposefully acquiring electronic larynx parallel corpora or using an internet existing corpus;
aiming at step S1, different data sets are collected for training according to the structural and functional requirements of each module neural network. The electronic larynx parallel linguistic data are acquired through targeted collection, and the rest linguistic data are all made of internet public data.
Aiming at the electronic larynx parallel corpus data, a source speaker and a target speaker are the same female volunteer in young. The volunteer is trained in use of the electronic larynx in advance before recording the voice of the electronic larynx, and the scene that a patient without the larynx relies on the electronic larynx equipment to make a sound is simulated by using the auxiliary sound production of the electronic larynx under the condition of not vibrating the vocal cords. The electronic throat equipment used was a product of Tianrem medical instruments, Inc. in Huzhou, and the recording sampling rate was 16 kHz. Parallel corpora with the duration of about 5 hours are acquired, and test data are reserved.
Aiming at the requirements of the voice recognition module on data quantity and coverage, an AISHELL-2 corpus containing 1000-hour voice and transcribed text is used for training. And the data of 185 persons in the AISHELL-3 corpus is used for training aiming at the timbre and timbre requirements of the vocoder module. Both of the above corpora are public data.
As shown in fig. 2, step S2: the speech recognition module is trained using large-scale speech recognition corpora.
The specific implementation of the speech recognition module is based on the Librispeech speech recognition template in the Kaldi Toolkit (Kaldi Toolkit). The acoustic model uses Mel-Filterbank Cepstrum Coefficient (MFCC) as a feature. The neural network consists of a 17-layer time delay neural network, a 256-dimensional information bottleneck layer and a Triphone (Triphone) posterior probability (namely 3-gram PPG) output layer. In the embodiment, the activation information of the neural network at the information bottleneck layer is used as the bottleneck layer characteristic specifically used.
Step S2 includes:
step S201: preparing data in an existing corpus to finish data preprocessing; the corpus is AISHELL-2.
Step S202: training a single-phoneme and three-phoneme duration model, and performing forced alignment on the speech and text pairs of training data;
step S203: an acoustic model is trained using triphone alignment information.
As shown in fig. 3 and 4, step S3: and performing joint training on the bottleneck layer mapping module and the acoustic feature reconstruction module by using the electronic larynx-normal voice parallel corpora.
The implementation of the bottleneck feature mapping module and the acoustic feature reconstruction module neural network in this embodiment includes a full connection layer preprocessing network, 8 layers of forward Transformer stacking, a full connection output layer, and a post-processing layer composed of 5 layers of convolutional neural networks. The width is modeled using 256 dimensions.
Step S3 includes:
step S301: taking electronic larynx parallel corpora as training data pairs, aligning by using a Dynamic Time Warping (DTW) technology, and respectively extracting bottleneck layer characteristics and acoustic characteristics;
step S302: matching the aligned and parallel electronic larynx speech bottleneck layer characteristics with the normal speech bottleneck layer characteristics, and training a neural network model of a bottleneck layer mapping module to be convergent by using a regression supervised learning method;
step S303: and (3) using aligned and parallel normal voice bottleneck layer feature-normal voice acoustic feature pairing, and using a regression and supervised learning method to train a neural network model of the acoustic feature reconstruction module to be converged.
Referring to fig. 5, which is a schematic diagram of the training and testing stages of the vocoder module according to the embodiment of the present invention, step S4: training the vocoder module by using a generation countermeasure network technology by using a Wiimer spectrum as an acoustic feature;
the vocoder module uses the weimeyer spectrum as a specific acoustic feature. The corresponding voice signal reconstruction neural network uses MelGAN as an implementation scheme. The model comprises a generator and a discriminator and is trained by utilizing a generation countermeasure Network (GAN) technology. Only the generator participates in its testing phase. The generator consists of a 4-layer Transposed convolutional neural network (Transposed Convolution) upsampling layer and a Residual connecting block (Residual Stack); and the multi-scale discriminator group comprises three convolutional neural networks working at the full sampling rate, 1/4 sampling rate and 1/8 sampling rate respectively.
Step S4 is as follows:
step S401: preparing corpus data and finishing data preprocessing; specifically, the corpus is AISHELL-3.
Step S402: and (3) simultaneously and alternately optimizing the generator and the discriminator by using a generative counterlearning technology, wherein the optimization goal of the generator is to make the classifier classification result wrong, and the optimization goal of the discriminator is to correctly distinguish normal voice from generator reconstructed voice.
In the testing stage, only the generator in the generation countermeasure network is used to realize the conversion of the acoustic characteristics into a voice signal with higher quality.
Step S5 of the present invention: step S5: and testing the voice recognition module, the bottleneck layer feature mapping module, the acoustic feature reconstruction module and the vocoder module by connecting input and output data of the voice recognition module, the bottleneck layer feature mapping module, the acoustic feature reconstruction module and the vocoder module in series to finish the conversion from the electronic larynx voice to the enhanced voice.
Step S5 includes the following steps:
step S501: collecting electronic larynx voices to obtain test input voice samples;
step S502: calculating the MFCC characteristics of the input voice sample, inputting the MFCC characteristics into an acoustic model neural network of a voice recognition module, and taking information bottleneck layer activation information of the MFCC characteristics as the bottleneck layer characteristics of the electronic larynx voice;
step S503: inputting the bottleneck characteristic of the electronic larynx voice into a bottleneck characteristic mapping module, and obtaining the estimated normal voice bottleneck characteristic through forward propagation;
step S504: inputting the normal voice bottleneck characteristics obtained by estimation into an acoustic characteristic reconstruction model, and obtaining Mel spectral characteristics corresponding to the enhanced voice through forward propagation;
step S505: inputting the reconstructed Mel spectrum characteristics into a generator neural network of the MelGAN vocoder module, and obtaining the electronic larynx enhanced voice signal through forward propagation.
Step S6: and performing an evaluation test on the test result.
In the test result evaluation experiment, the evaluation scale comprises two aspects of enhancing the naturalness and the intelligibility of the voice, and the experiment comprises two parts of objective evaluation and subjective evaluation. In order to compare with the effect of the conversion system described in this embodiment, the unconverted electronic larynx speech and the converted speech obtained by using a statistical machine learning method such as Gaussian Mixture Model (GMM) are introduced to participate in evaluation.
Objective evaluation: in this experiment, the similarity between the audio obtained by enhancement of the conversion system and the ideal target audio is objectively evaluated using Mel-Cepstral Distortion (MCD), thereby measuring the naturalness of the generated speech. This parameter is calculated by the following formula:
Figure BDA0003520518740000121
wherein
Figure BDA0003520518740000122
And
Figure BDA0003520518740000123
respectively, a d-th order target speech cepstrum parameter and a d-th order enhanced speech cepstrum parameter. The larger the MCD value is, the more serious the distortion between two pieces of audio is, and the lower the similarity of the two pieces of audio is. Conversely, the larger the audio similarity, the unit is decibels (dB).
Objective evaluation uses a speech recognition system to recognize the text content of the converted speech and evaluates the intelligibility of the converted speech by calculating the recognition result and the Character Error Rate (CER) of the actual text content.
Subjective evaluation:
in subjective evaluation, 15 volunteers are invited to audition a series of recorded or converted voices, and the naturalness, intelligibility and similarity with reference normal voices of corresponding audios are scored according to subjective judgment of the volunteers. The scale of the scale is: 5-very good, 4-good, 3-normal, 2-bad, 1-very bad. Four speech samples were subjected to this subjective scoring experiment: 1) source electronic larynx speech;
2) enhanced speech based on statistical machine learning methods; 3) enhancing voice based on the method of the invention; 4) the normal speech (reference speech) of the targeted speaker.
The experimental results are as follows:
objective evaluation results:
the MCD calculated from the original electronic larynx speech, the speech enhanced by the GMM method, and the enhanced speech and the target normal speech of the system of this embodiment is shown in table 1. Observing table 1, it can be seen that the system of the present embodiment can reduce the distortion rate of the electronic larynx speech and the target speech by 5.915 dB. Generally, MCD is considered to have correlation with human auditory perception, which proves that the system of this embodiment has significantly improved naturalness of electronic larynx speech compared with GMM system.
TABLE 1
Figure BDA0003520518740000131
The results of the calculation of the character error rates of the original electronic larynx speech, the enhanced speech by the GMM system, the enhanced speech by the system of this embodiment, and the system enhanced speech and normal speech of this embodiment without the bottleneck feature mapping module are shown in table 2. The trend of the character error rate in the table is approximately equivalent to that in table 1, it is easy to see that the system of the present embodiment can significantly improve the chance that (+ 11.95%) electronic larynx speech is correctly recognized by the normal speech recognition system, i.e. significantly improve the intelligibility of speech. In addition, by comparing the experimental results of the system with or without the bottleneck feature mapping module, the design of bottleneck layer feature mapping and acoustic feature reconstruction and separation provided by the invention can effectively improve the intelligibility of enhanced speech.
TABLE 2
Figure BDA0003520518740000132
Subjective evaluation results:
table 3 shows the volunteer subjective scoring results (mean) for electronic larynx speech, GMM system, the system and normal speech samples. It is easy to observe that in subjective evaluation, the system of the embodiment achieves significant improvement in naturalness, intelligibility and similarity compared with the results of unprocessed electronic larynx speech and enhancement using the conventional GMM system.
TABLE 3
Figure BDA0003520518740000141
It is noted that the same or similar reference numerals correspond to the same or similar components.
The positional relationships depicted in the drawings are for illustrative purposes only and should not be construed as limiting the present patent; the foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (11)

1. A Chinese electronic larynx voice conversion device based on a deep neural network is characterized by comprising the following sub-modules:
the speech recognition module is used for extracting a derived linguistic feature sequence from the electronic larynx speech as a bottleneck layer feature;
the bottleneck characteristic mapping module is used for estimating the corresponding normal voice bottleneck layer characteristics according to the bottleneck layer characteristics extracted from the electronic larynx voice and predicting the tone and/or other linguistic information lacking in the electronic larynx voice;
the acoustic feature reconstruction module is used for synthesizing the estimated normal bottleneck layer features into an acoustic feature sequence corresponding to the specific voice;
a vocoder module for extracting acoustic features using mel-frequency spectrum of voice as the acoustic features and reconstructing voice signal from the acoustic features.
2. The apparatus for converting chinese speech according to claim 1, wherein the speech recognition module, the bottleneck feature mapping module, the acoustic feature reconstruction module and the vocoder module are optimized independently using the required data, and form a concatenation of the electronic larynx speech and the electronic larynx bottleneck layer feature, the normal bottleneck layer feature, the acoustic feature and the enhanced speech in the testing stage, so as to complete the conversion from the electronic larynx speech to the enhanced speech with high intelligibility and naturalness.
3. The apparatus as claimed in claim 1, wherein the speech recognition module is structurally divided into an acoustic model and a decoder, wherein the acoustic model adopts a multi-layer neural network to accept speech signal input and generate linguistic features of the speech signal; and the decoder receives the input of the linguistic features and outputs a speech recognition result, wherein the linguistic features are supervised phoneme N-gram posterior probability or implicit expression of a neural network without an explicit structure, and the bottleneck layer features refer to the forward propagation activation information of the acoustic model output layer or the similar layer.
4. The apparatus as claimed in claim 1, wherein the vocoder module comprises an acoustic feature extraction sub-module and a speech signal reconstruction sub-module, the acoustic feature extraction sub-module performs acoustic feature extraction by using extraction algorithms of signal pre-emphasis, short-time fourier transform, mel filtering and numerical statistics, and the speech signal reconstruction sub-module reconstructs speech signals by using a neural network vocoder designed based on deep learning technology.
5. The apparatus as recited in claim 1, wherein the acoustic feature reconstruction module models a plurality of reconstructed timbres using a multi-speaker speech conversion model to produce a plurality of timbres.
6. The apparatus for converting chinese electronic larynx speech according to claim 1, wherein the bottleneck feature mapping module employs a pre-training-fine tuning two-stage training mode, converts large-scale normal speech corpora into analog electronic larynx speech with a constant excitation signal using a software simulation method at the pre-training stage, and performs pre-training using the analog electronic larynx-normal parallel corpora; and in the fine adjustment stage, real parallel corpora obtained through recording are used for training fine adjustment.
7. A Chinese electronic larynx speech conversion method based on a deep neural network is realized based on any one of the Chinese electronic larynx speech conversion devices of claims 1-6, and is characterized by comprising the following steps:
step S1: data acquisition is carried out by purposefully acquiring electronic larynx parallel corpora or using an internet existing corpus;
step S2: training a voice recognition module by using large-scale voice recognition corpora;
step S3: performing joint training on the bottleneck layer mapping module and the acoustic feature reconstruction module by using the electronic larynx-normal voice parallel corpus;
step S4: training the vocoder module by using a generation countermeasure network technology by using a Wiimer spectrum as an acoustic feature;
step S5: testing the voice recognition module, the bottleneck layer feature mapping module, the acoustic feature reconstruction module and the vocoder module by connecting input and output data of the voice recognition module, the bottleneck layer feature mapping module, the acoustic feature reconstruction module and the vocoder module in series to finish the conversion from electronic larynx voice to enhanced voice;
step S6: and performing an evaluation test on the test result.
8. The method for converting chinese language speech according to claim 7, wherein said step S2 includes:
step S201: preparing data in an existing corpus to finish data preprocessing;
step S202: training a single-phoneme and three-phoneme duration model, and performing forced alignment on the speech and text pairs of training data;
step S203: an acoustic model is trained using triphone alignment information.
9. The method for converting chinese language speech according to claim 7, wherein said step S3 includes:
step S301: taking electronic larynx parallel linguistic data as training data pairing, aligning by using a dynamic time warping technology, and respectively extracting bottleneck layer characteristics and acoustic characteristics;
step S302: matching the aligned and parallel electronic larynx speech bottleneck layer characteristics with the normal speech bottleneck layer characteristics, and training a neural network model of a bottleneck layer mapping module to be convergent by using a regression supervised learning method;
step S303: and (3) using aligned and parallel normal voice bottleneck layer feature-normal voice acoustic feature pairing, and using a regression and supervised learning method to train a neural network model of the acoustic feature reconstruction module to be converged.
10. The method for converting chinese language speech according to claim 7, wherein said step S4 comprises the steps of:
step S401: preparing corpus data and finishing data preprocessing;
step S402: and (3) simultaneously and alternately optimizing the generator and the discriminator by using a generative counterlearning technology, wherein the optimization goal of the generator is to make the classifier classification result wrong, and the optimization goal of the discriminator is to correctly distinguish normal voice from generator reconstructed voice.
11. The method for converting chinese language speech according to claim 7, wherein said step S5 comprises the steps of:
step S501: collecting electronic larynx voices to obtain test input voice samples;
step S502: calculating the MFCC characteristics of the input voice sample, inputting the MFCC characteristics into an acoustic model neural network of a voice recognition module, and taking information bottleneck layer activation information of the MFCC characteristics as the bottleneck layer characteristics of the electronic larynx voice;
step S503: inputting the bottleneck characteristic of the electronic larynx voice into a bottleneck characteristic mapping module, and obtaining the estimated normal voice bottleneck characteristic through forward propagation;
step S504: inputting the normal voice bottleneck characteristics obtained by estimation into an acoustic characteristic reconstruction model, and obtaining Mel spectral characteristics corresponding to the enhanced voice through forward propagation;
step S505: inputting the reconstructed Mel spectrum characteristics into a generator neural network of a vocoder module, and obtaining an electronic larynx enhanced voice signal through forward propagation.
CN202210180441.9A 2022-02-25 2022-02-25 Deep neural network-based Chinese electronic larynx voice conversion device and method Pending CN114550701A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210180441.9A CN114550701A (en) 2022-02-25 2022-02-25 Deep neural network-based Chinese electronic larynx voice conversion device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210180441.9A CN114550701A (en) 2022-02-25 2022-02-25 Deep neural network-based Chinese electronic larynx voice conversion device and method

Publications (1)

Publication Number Publication Date
CN114550701A true CN114550701A (en) 2022-05-27

Family

ID=81679930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210180441.9A Pending CN114550701A (en) 2022-02-25 2022-02-25 Deep neural network-based Chinese electronic larynx voice conversion device and method

Country Status (1)

Country Link
CN (1) CN114550701A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Similar Documents

Publication Publication Date Title
CN103928023B (en) A kind of speech assessment method and system
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
CN113436606B (en) Original sound speech translation method
CN106448673B (en) chinese electronic larynx speech conversion method
CN104992707A (en) Cleft palate voice glottal stop automatic identification algorithm and device
Janke et al. A spectral mapping method for EMG-based recognition of silent speech
Murugappan et al. DWT and MFCC based human emotional speech classification using LDA
CN111326170B (en) Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution
CN112382308A (en) Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN105788608A (en) Chinese initial consonant and compound vowel visualization method based on neural network
Zahner et al. Conversion from facial myoelectric signals to speech: a unit selection approach
CN114550701A (en) Deep neural network-based Chinese electronic larynx voice conversion device and method
CN116364096B (en) Electroencephalogram signal voice decoding method based on generation countermeasure network
Yang et al. Electrolaryngeal speech enhancement based on a two stage framework with bottleneck feature refinement and voice conversion
Padmini et al. Age-Based Automatic Voice Conversion Using Blood Relation for Voice Impaired.
JP4381404B2 (en) Speech synthesis system, speech synthesis method, speech synthesis program
CN114283822A (en) Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient
TWI780738B (en) Abnormal articulation corpus amplification method and system, speech recognition platform, and abnormal articulation auxiliary device
Shah et al. Non-audible murmur to audible speech conversion
Mantilla-Caeiros et al. A pattern recognition based esophageal speech enhancement system
Lv et al. Objective evaluation method of broadcasting vocal timbre based on feature selection
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
Sahoo et al. Detection of speech-based physical load using transfer learning approach
Chadha et al. Analysis of a modern voice morphing approach using gaussian mixture models for laryngectomees
Zhu et al. A study of the robustness of raw waveform based speaker embeddings under mismatched conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination