CN110808026B

CN110808026B - Electroglottography voice conversion method based on LSTM

Info

Publication number: CN110808026B
Application number: CN201911065541.1A
Authority: CN
Inventors: 陈立江; 王龙; 张井合
Original assignee: Jinhua Hangda Beidou Application Technology Co ltd
Current assignee: Jinhua Hangda Beidou Application Technology Co ltd
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2022-08-23
Anticipated expiration: 2039-11-04
Also published as: CN110808026A

Abstract

The invention provides an electroacoustic glottal image voice conversion method based on LSTM, which comprises the steps of firstly extracting features of an electroacoustic glottal image and splicing, further designing the similarity of converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted by the electroacoustic glottal image through the trained model and synthesizing voice. The invention extracts and splices the electroglottography characteristics, combines with a standard phoneme sequence obtained by disassembling an LSTM network and standard voice data to obtain a prediction model which takes the electroglography characteristics sequence as input and outputs and predicts the current phoneme, solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of standard voice and converted voice for designing a loss function used by the training model, and simultaneously adopts a Klatt formant voice synthesizer and configures a formant filter to obtain real voice.

Description

Electroglottography voice conversion method based on LSTM

Technical Field

The invention designs an LSTM-based electroglottography voice conversion method, which can predict a voice to be synthesized at present by acquiring the input of electroglography data at the present moment and the past moment, and belongs to the field of computers.

Background

An electroglottic graph (EGG for short) is the vocal cord movement information of the larynx when speaking collected by two electrodes placed on the larynx, has extremely high correlation with the voice information sent by a person, and the characteristics extracted from the vocal cord movement information can be used for recovering the corresponding voice information.

Formant speech synthesis techniques are currently the more mature speech synthesis techniques. Formant speech synthesis utilizes the resonance characteristics of the acoustic channel to the speech excitation, and can form a formant filter by extracting each formant frequency of the acoustic channel and the bandwidth thereof as parameters. The parameters of the formant filter are configured, so that different voices can be controlled and synthesized.

In practical applications, many patients have difficulty making sounds for different reasons, but their vocal cords can still vibrate, and if the voice can be synthesized by extracting the electroglottography of the patient, the ability of the patient to resume communication can be greatly assisted.

Disclosure of Invention

In order to recover speech data from electroglottography data, the invention proposes an LSTM-based electroglography speech conversion method.

The invention provides an LSTM-based electroacoustic glottal image voice conversion method, which comprises the steps of,

step A: and extracting characteristics of the electroglottography and splicing.

The electric glottal graph detects the closing and separating of the vocal cords by detecting the impedance when the vocal cords vibrate, reflects the regularity of the vocal cords vibration, and contains rich characteristics related to voice. In order to realize the prediction of the voice, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of the glottal graph signal are selected and extracted as training characteristics. The electroglottis graph signal is a one-dimensional signal taking time as an axis, the electroglottis graph signal is divided into frames with the length of 20ms, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of the electroglottis graph in the frame are calculated, and then the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation are spliced with the calculated characteristics of the first 9 frames, so that the electroglottis graph signal can be converted into a 40-dimensional characteristic sequence.

And B, step B: the similarity between the converted speech and the standard speech is designed.

Designing a method for calculating the similarity between the synthesized speech and the standard speech, wherein the standard speech adopted for calculating the similarity is not sampling data of real speech but a phoneme sequence obtained by decomposing the standard speech; the synthesized speech is also not the actual synthesized speech data, but rather a predicted sequence of phonemes output by the model. By serializing the standard speech and the synthesized speech in the form of phonemes, the problem of speech synthesis is translated into the problem of phoneme prediction for the current time. The similarity calculation problem of the synthesized speech and the standard speech is converted into the similarity calculation problem of the standard phoneme sequence and the predicted phoneme sequence. And adopting cross entropy as a mode for calculating the similarity of the two sequences, wherein the larger the cross entropy is, the lower the similarity is.

And C: and training the phoneme prediction model.

The application also provides a phoneme prediction model design method based on the LSTM (Long-short term memory), and the LSTM model is a special RNN model. In a traditional RNN model, a time-based back propagation algorithm (BPTT) is used for updating model parameters, and when a time interval becomes long, a residual error to be returned will show an exponential decrease, which causes a problem of gradient dispersion, resulting in slow update of network parameters and difficulty in convergence. The LSTM network is proposed to solve the problem that the conventional RNN network is difficult to implement long-term memory.

Firstly, a large amount of linguistic data are prepared, phoneme sequences are extracted from the linguistic data to serve as standard data, electroglottal graph data corresponding to the linguistic data are obtained from a plurality of patients, and the electroglottal graph data are converted into feature sequences to serve as training data of a model. The feature sequence generated by the electroglottis signal and the LSTM network are combined, so that the training of a prediction model and the prediction of phonemes by the prediction model can be realized.

In the training of the model, the electroglottography characteristic sequences corresponding to a batch of linguistic data are input into an LSTM network, a phoneme prediction sequence is further obtained, cross entropy is used as a loss function (LossFunction), and the model is optimized by combining a back propagation and learning rate self-adaptive algorithm.

Step D: and predicting the current phoneme by using the trained model and the feature sequence converted by the electroglottography, and synthesizing the speech.

The vocoder used in the actual application is the Klatt formant vocoder. The Klatt formant speech synthesizer generates various speeches through the control of six formants. The synthesizer uses a series branch to generate vowels, uses a parallel branch to generate consonants, and configures parameters of a parallel filter and a series filter in the Klatt synthesizer and the state of an unvoiced and voiced sound switch, so that corresponding voices can be synthesized. The method comprises the steps of storing 32 basic phonemes of the Mandarin Chinese and corresponding Klatt synthesizer parameters of the basic phonemes into a dictionary as key value pairs in advance, directly reading configuration parameters of the phonemes from the dictionary to configure the Klatt synthesizer according to prediction of current factors by a prediction model, namely obtaining real speech corresponding to the phonemes, and realizing conversion from an electroacoustic glottal image signal to a real speech signal.

The invention provides an electroacoustic glottal image voice conversion method based on LSTM, which comprises the steps of firstly extracting features of an electroacoustic glottal image and splicing, further designing the similarity of converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted by the electroacoustic glottal image through the trained model and synthesizing voice. The invention extracts and splices the characteristic of the electroglottography, and combines the LSTM network and the standard phoneme sequence obtained by the disassembly of the standard voice data to obtain a prediction model which takes the electroglography characteristic sequence as input and outputs and predicts the current phoneme, and solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of the standard voice and the converted voice for designing the loss function used by the training model.

Drawings

FIG. 1 is an overall flow chart of the LSTM-based electroacoustic glottal speech conversion method proposed by the present invention;

FIG. 2 is a flow chart of the present invention for electroglottography signal conversion to a signature sequence;

FIG. 3 is a flowchart of the difference calculation between the converted speech and the standard speech according to the present invention;

FIG. 4 is a flow chart of a phoneme prediction model training process proposed by the present invention;

fig. 5 is a flow chart of the present invention for real and speech synthesis based on predicted phoneme and Klatt synthesizer.

Detailed Description

As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. The following description is of the preferred embodiment for carrying out the invention, and is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.

The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.

Examples

The invention provides an LSTM-based electroacoustic glottal voice conversion method, which comprises the steps of,

step A: extracting characteristics of the electroglottography and splicing;

and B: designing the similarity of the converted voice and the standard voice;

and C: training a phoneme prediction model;

As shown in fig. 1, the method for voice conversion of electro-glottography based on LSTM firstly extracts features from the electro-glottography and converts the features into feature sequences of forty dimensions, inputs the feature sequences into a model, simultaneously trains the model by taking the phoneme sequences corresponding to the same time period as labels and taking the cross entropy of the standard phoneme sequences and the prediction sequences as a loss function until the loss function of the model converges, and then completes the training of the prediction model. When the electric glottal graph and the voice are converted, the electric glottal graph is converted into a feature sequence and input into a prediction model, the prediction model outputs prediction phonemes, and Klatt synthesizer configuration parameters corresponding to the phonemes are found from a dictionary to configure a Klatt synthesizer, so that the real voice corresponding to the electric glottal graph can be generated.

As shown in fig. 2, in the method for extracting and splicing the electroglottography characteristics in step a, firstly, the electroglottography signal is sampled, and the sampling rate is 8 KHz. The sampling data of the electroglottis image is divided into frames according to the length of 20ms, filtering processing is carried out, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of each frame are calculated, and the calculated characteristics of the frame are spliced with the calculation results of the previous 9 frames to form 40-dimensional characteristic vectors corresponding to the frame.

As shown in fig. 3, the similarity of the converted speech and the standard speech in step B is calculated. Standard speech is first converted into a standard mandarin chinese phone sequence and the phones are then subjected to one-hot encoding, i.e., one phone is converted into a 32-dimensional vector. The prediction model predicts the current phoneme to obtain a 32-dimensional probability vector, and takes the phoneme with the highest probability as output. The two vectors are used as cross entropy to measure the similarity between the converted speech and the standard speech. The smaller the cross entropy is, the higher the similarity of the zero sequence is, and the better the prediction effect of the model is.

As shown in fig. 4, the phoneme prediction model is trained in step C. Firstly, converting electroglottal graph data and linguistic data in a database into an electroglottal graph feature sequence and a standard phoneme sequence, wherein the electroglottal graph feature sequence is used as training input of a prediction model, and the standard phoneme is used as a training label. The loss function is designed using the method described in step B. In the aspect of parameter optimization of the model, the model adopts a batch training method, 128 sentences randomly selected each time are used as a batch of data to be trained, and an adaptive moment estimation method (adaptive moment estimation) is adopted to update the learning rate.

In step D, the current phoneme is predicted by using the trained model and the feature sequence converted by the electroglottography, and the speech is synthesized according to the prediction. As shown in fig. 5, a flow of converting phonemes predicted by a prediction model into real speech is shown. The speech synthesis apparatus used in this application is a Klatt formant synthesizer. The Klatt formant synthesizer is a hybrid speech synthesizer that can synthesize different voices by configuring the parameters of its series and parallel filters. The method stores 32 standard Chinese mandarin basic phonemes and corresponding Klatt formant synthesizer configuration parameters into key value pairs in a dictionary in advance. When the electroglottography voice conversion is carried out, the configuration parameters are taken out from the dictionary by taking the predicted phonemes as keys, and a Klatt formant synthesizer is configured, so that the voice corresponding to the phonemes is obtained.

The invention provides an LSTM-based electroacoustic glottin voice conversion method, which comprises the steps of firstly extracting and splicing features of an electroacoustic glottin, further designing the similarity between converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted from the electroacoustic glottin by using the trained model and synthesizing voice. The invention extracts and splices the characteristic of the electroglottography, and combines the LSTM network and the standard phoneme sequence obtained by the disassembly of the standard voice data to obtain a prediction model which takes the electroglography characteristic sequence as input and outputs and predicts the current phoneme, and solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of the standard voice and the converted voice for designing the loss function used by the training model.

While the foregoing description shows and describes the preferred embodiments of the present invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not intended to be exhaustive or to exclude other embodiments and may be used in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept described herein, as determined by the above teachings or as determined by the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An LSTM-based electroglottography voice conversion method comprises the steps of,

a: extracting characteristics of the electroglottography and splicing;

b: designing the similarity of the converted voice and the standard voice;

c: training a phoneme prediction model;

d: pre-encoding current phonemes using trained models and feature sequences transformed from electroglottography

And then the voice is synthesized.

2. An LSTM-based electroglottographic speech conversion method as claimed in claim 1, wherein: step A, extracting and splicing the electroglottis image features, namely framing the electroglottis image by taking 20ms as a length, extracting features including but not limited to fundamental frequency, unit time energy, frequency perturbation and amplitude perturbation of each frame, splicing the features extracted from every ten adjacent frames, and converting the electroglottis image into a feature sequence.

3. An LSTM-based electroglottographic speech conversion method as claimed in claim 2, wherein: and step B, designing the similarity between the converted speech and the standard speech, namely converting the standard speech into a standard Chinese Putonghua phoneme sequence, and calculating the similarity with a prediction model output sequence by using cross entropy.

4. An LSTM-based electroglottographic speech conversion method according to any one of claims 1-3, characterized in that: and step C, training the phoneme prediction model, namely taking an LSTM network as the prediction model, inputting the training model as an electroacoustic glottal characteristic sequence, outputting the model as a phoneme prediction sequence, taking a training label as a standard phoneme sequence, and training the prediction model by taking cross entropy as a loss function.

5. An LSTM-based electroglottographic speech conversion method according to claim 4, characterized in that: step D, using the trained model to predict the current phoneme by the feature sequence converted by the electroglottic chart and synthesize the voice by the prediction, firstly converting the electroglottic chart into the feature sequence to be input into the prediction model, outputting the predicted phoneme by the model, taking out the corresponding parameters of the phoneme from a dictionary which stores 32 basic phonemes and corresponding synthesis parameters of Chinese, and configuring a Klatt formant voice synthesis model to realize the conversion from the electroglottic chart to the voice.