CN110808026B - Electroglottography voice conversion method based on LSTM - Google Patents

Electroglottography voice conversion method based on LSTM Download PDF

Info

Publication number
CN110808026B
CN110808026B CN201911065541.1A CN201911065541A CN110808026B CN 110808026 B CN110808026 B CN 110808026B CN 201911065541 A CN201911065541 A CN 201911065541A CN 110808026 B CN110808026 B CN 110808026B
Authority
CN
China
Prior art keywords
voice
phoneme
model
standard
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911065541.1A
Other languages
Chinese (zh)
Other versions
CN110808026A (en
Inventor
陈立江
王龙
张井合
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinhua Hangda Beidou Application Technology Co ltd
Original Assignee
Jinhua Hangda Beidou Application Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinhua Hangda Beidou Application Technology Co ltd filed Critical Jinhua Hangda Beidou Application Technology Co ltd
Priority to CN201911065541.1A priority Critical patent/CN110808026B/en
Publication of CN110808026A publication Critical patent/CN110808026A/en
Application granted granted Critical
Publication of CN110808026B publication Critical patent/CN110808026B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides an electroacoustic glottal image voice conversion method based on LSTM, which comprises the steps of firstly extracting features of an electroacoustic glottal image and splicing, further designing the similarity of converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted by the electroacoustic glottal image through the trained model and synthesizing voice. The invention extracts and splices the electroglottography characteristics, combines with a standard phoneme sequence obtained by disassembling an LSTM network and standard voice data to obtain a prediction model which takes the electroglography characteristics sequence as input and outputs and predicts the current phoneme, solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of standard voice and converted voice for designing a loss function used by the training model, and simultaneously adopts a Klatt formant voice synthesizer and configures a formant filter to obtain real voice.

Description

Electroglottography voice conversion method based on LSTM
Technical Field
The invention designs an LSTM-based electroglottography voice conversion method, which can predict a voice to be synthesized at present by acquiring the input of electroglography data at the present moment and the past moment, and belongs to the field of computers.
Background
An electroglottic graph (EGG for short) is the vocal cord movement information of the larynx when speaking collected by two electrodes placed on the larynx, has extremely high correlation with the voice information sent by a person, and the characteristics extracted from the vocal cord movement information can be used for recovering the corresponding voice information.
Formant speech synthesis techniques are currently the more mature speech synthesis techniques. Formant speech synthesis utilizes the resonance characteristics of the acoustic channel to the speech excitation, and can form a formant filter by extracting each formant frequency of the acoustic channel and the bandwidth thereof as parameters. The parameters of the formant filter are configured, so that different voices can be controlled and synthesized.
In practical applications, many patients have difficulty making sounds for different reasons, but their vocal cords can still vibrate, and if the voice can be synthesized by extracting the electroglottography of the patient, the ability of the patient to resume communication can be greatly assisted.
Disclosure of Invention
In order to recover speech data from electroglottography data, the invention proposes an LSTM-based electroglography speech conversion method.
The invention provides an LSTM-based electroacoustic glottal image voice conversion method, which comprises the steps of,
step A: and extracting characteristics of the electroglottography and splicing.
The electric glottal graph detects the closing and separating of the vocal cords by detecting the impedance when the vocal cords vibrate, reflects the regularity of the vocal cords vibration, and contains rich characteristics related to voice. In order to realize the prediction of the voice, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of the glottal graph signal are selected and extracted as training characteristics. The electroglottis graph signal is a one-dimensional signal taking time as an axis, the electroglottis graph signal is divided into frames with the length of 20ms, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of the electroglottis graph in the frame are calculated, and then the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation are spliced with the calculated characteristics of the first 9 frames, so that the electroglottis graph signal can be converted into a 40-dimensional characteristic sequence.
And B, step B: the similarity between the converted speech and the standard speech is designed.
Designing a method for calculating the similarity between the synthesized speech and the standard speech, wherein the standard speech adopted for calculating the similarity is not sampling data of real speech but a phoneme sequence obtained by decomposing the standard speech; the synthesized speech is also not the actual synthesized speech data, but rather a predicted sequence of phonemes output by the model. By serializing the standard speech and the synthesized speech in the form of phonemes, the problem of speech synthesis is translated into the problem of phoneme prediction for the current time. The similarity calculation problem of the synthesized speech and the standard speech is converted into the similarity calculation problem of the standard phoneme sequence and the predicted phoneme sequence. And adopting cross entropy as a mode for calculating the similarity of the two sequences, wherein the larger the cross entropy is, the lower the similarity is.
And C: and training the phoneme prediction model.
The application also provides a phoneme prediction model design method based on the LSTM (Long-short term memory), and the LSTM model is a special RNN model. In a traditional RNN model, a time-based back propagation algorithm (BPTT) is used for updating model parameters, and when a time interval becomes long, a residual error to be returned will show an exponential decrease, which causes a problem of gradient dispersion, resulting in slow update of network parameters and difficulty in convergence. The LSTM network is proposed to solve the problem that the conventional RNN network is difficult to implement long-term memory.
Firstly, a large amount of linguistic data are prepared, phoneme sequences are extracted from the linguistic data to serve as standard data, electroglottal graph data corresponding to the linguistic data are obtained from a plurality of patients, and the electroglottal graph data are converted into feature sequences to serve as training data of a model. The feature sequence generated by the electroglottis signal and the LSTM network are combined, so that the training of a prediction model and the prediction of phonemes by the prediction model can be realized.
In the training of the model, the electroglottography characteristic sequences corresponding to a batch of linguistic data are input into an LSTM network, a phoneme prediction sequence is further obtained, cross entropy is used as a loss function (LossFunction), and the model is optimized by combining a back propagation and learning rate self-adaptive algorithm.
Step D: and predicting the current phoneme by using the trained model and the feature sequence converted by the electroglottography, and synthesizing the speech.
The vocoder used in the actual application is the Klatt formant vocoder. The Klatt formant speech synthesizer generates various speeches through the control of six formants. The synthesizer uses a series branch to generate vowels, uses a parallel branch to generate consonants, and configures parameters of a parallel filter and a series filter in the Klatt synthesizer and the state of an unvoiced and voiced sound switch, so that corresponding voices can be synthesized. The method comprises the steps of storing 32 basic phonemes of the Mandarin Chinese and corresponding Klatt synthesizer parameters of the basic phonemes into a dictionary as key value pairs in advance, directly reading configuration parameters of the phonemes from the dictionary to configure the Klatt synthesizer according to prediction of current factors by a prediction model, namely obtaining real speech corresponding to the phonemes, and realizing conversion from an electroacoustic glottal image signal to a real speech signal.
The invention provides an electroacoustic glottal image voice conversion method based on LSTM, which comprises the steps of firstly extracting features of an electroacoustic glottal image and splicing, further designing the similarity of converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted by the electroacoustic glottal image through the trained model and synthesizing voice. The invention extracts and splices the characteristic of the electroglottography, and combines the LSTM network and the standard phoneme sequence obtained by the disassembly of the standard voice data to obtain a prediction model which takes the electroglography characteristic sequence as input and outputs and predicts the current phoneme, and solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of the standard voice and the converted voice for designing the loss function used by the training model.
Drawings
FIG. 1 is an overall flow chart of the LSTM-based electroacoustic glottal speech conversion method proposed by the present invention;
FIG. 2 is a flow chart of the present invention for electroglottography signal conversion to a signature sequence;
FIG. 3 is a flowchart of the difference calculation between the converted speech and the standard speech according to the present invention;
FIG. 4 is a flow chart of a phoneme prediction model training process proposed by the present invention;
fig. 5 is a flow chart of the present invention for real and speech synthesis based on predicted phoneme and Klatt synthesizer.
Detailed Description
As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. The following description is of the preferred embodiment for carrying out the invention, and is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.
The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.
Examples
The invention provides an LSTM-based electroacoustic glottal voice conversion method, which comprises the steps of,
step A: extracting characteristics of the electroglottography and splicing;
and B: designing the similarity of the converted voice and the standard voice;
and C: training a phoneme prediction model;
step D: and predicting the current phoneme by using the trained model and the feature sequence converted by the electroglottography, and synthesizing the speech.
As shown in fig. 1, the method for voice conversion of electro-glottography based on LSTM firstly extracts features from the electro-glottography and converts the features into feature sequences of forty dimensions, inputs the feature sequences into a model, simultaneously trains the model by taking the phoneme sequences corresponding to the same time period as labels and taking the cross entropy of the standard phoneme sequences and the prediction sequences as a loss function until the loss function of the model converges, and then completes the training of the prediction model. When the electric glottal graph and the voice are converted, the electric glottal graph is converted into a feature sequence and input into a prediction model, the prediction model outputs prediction phonemes, and Klatt synthesizer configuration parameters corresponding to the phonemes are found from a dictionary to configure a Klatt synthesizer, so that the real voice corresponding to the electric glottal graph can be generated.
As shown in fig. 2, in the method for extracting and splicing the electroglottography characteristics in step a, firstly, the electroglottography signal is sampled, and the sampling rate is 8 KHz. The sampling data of the electroglottis image is divided into frames according to the length of 20ms, filtering processing is carried out, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of each frame are calculated, and the calculated characteristics of the frame are spliced with the calculation results of the previous 9 frames to form 40-dimensional characteristic vectors corresponding to the frame.
As shown in fig. 3, the similarity of the converted speech and the standard speech in step B is calculated. Standard speech is first converted into a standard mandarin chinese phone sequence and the phones are then subjected to one-hot encoding, i.e., one phone is converted into a 32-dimensional vector. The prediction model predicts the current phoneme to obtain a 32-dimensional probability vector, and takes the phoneme with the highest probability as output. The two vectors are used as cross entropy to measure the similarity between the converted speech and the standard speech. The smaller the cross entropy is, the higher the similarity of the zero sequence is, and the better the prediction effect of the model is.
As shown in fig. 4, the phoneme prediction model is trained in step C. Firstly, converting electroglottal graph data and linguistic data in a database into an electroglottal graph feature sequence and a standard phoneme sequence, wherein the electroglottal graph feature sequence is used as training input of a prediction model, and the standard phoneme is used as a training label. The loss function is designed using the method described in step B. In the aspect of parameter optimization of the model, the model adopts a batch training method, 128 sentences randomly selected each time are used as a batch of data to be trained, and an adaptive moment estimation method (adaptive moment estimation) is adopted to update the learning rate.
In step D, the current phoneme is predicted by using the trained model and the feature sequence converted by the electroglottography, and the speech is synthesized according to the prediction. As shown in fig. 5, a flow of converting phonemes predicted by a prediction model into real speech is shown. The speech synthesis apparatus used in this application is a Klatt formant synthesizer. The Klatt formant synthesizer is a hybrid speech synthesizer that can synthesize different voices by configuring the parameters of its series and parallel filters. The method stores 32 standard Chinese mandarin basic phonemes and corresponding Klatt formant synthesizer configuration parameters into key value pairs in a dictionary in advance. When the electroglottography voice conversion is carried out, the configuration parameters are taken out from the dictionary by taking the predicted phonemes as keys, and a Klatt formant synthesizer is configured, so that the voice corresponding to the phonemes is obtained.
The invention provides an LSTM-based electroacoustic glottin voice conversion method, which comprises the steps of firstly extracting and splicing features of an electroacoustic glottin, further designing the similarity between converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted from the electroacoustic glottin by using the trained model and synthesizing voice. The invention extracts and splices the characteristic of the electroglottography, and combines the LSTM network and the standard phoneme sequence obtained by the disassembly of the standard voice data to obtain a prediction model which takes the electroglography characteristic sequence as input and outputs and predicts the current phoneme, and solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of the standard voice and the converted voice for designing the loss function used by the training model.
While the foregoing description shows and describes the preferred embodiments of the present invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not intended to be exhaustive or to exclude other embodiments and may be used in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept described herein, as determined by the above teachings or as determined by the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. An LSTM-based electroglottography voice conversion method comprises the steps of,
a: extracting characteristics of the electroglottography and splicing;
b: designing the similarity of the converted voice and the standard voice;
c: training a phoneme prediction model;
d: pre-encoding current phonemes using trained models and feature sequences transformed from electroglottography
And then the voice is synthesized.
2. An LSTM-based electroglottographic speech conversion method as claimed in claim 1, wherein: step A, extracting and splicing the electroglottis image features, namely framing the electroglottis image by taking 20ms as a length, extracting features including but not limited to fundamental frequency, unit time energy, frequency perturbation and amplitude perturbation of each frame, splicing the features extracted from every ten adjacent frames, and converting the electroglottis image into a feature sequence.
3. An LSTM-based electroglottographic speech conversion method as claimed in claim 2, wherein: and step B, designing the similarity between the converted speech and the standard speech, namely converting the standard speech into a standard Chinese Putonghua phoneme sequence, and calculating the similarity with a prediction model output sequence by using cross entropy.
4. An LSTM-based electroglottographic speech conversion method according to any one of claims 1-3, characterized in that: and step C, training the phoneme prediction model, namely taking an LSTM network as the prediction model, inputting the training model as an electroacoustic glottal characteristic sequence, outputting the model as a phoneme prediction sequence, taking a training label as a standard phoneme sequence, and training the prediction model by taking cross entropy as a loss function.
5. An LSTM-based electroglottographic speech conversion method according to claim 4, characterized in that: step D, using the trained model to predict the current phoneme by the feature sequence converted by the electroglottic chart and synthesize the voice by the prediction, firstly converting the electroglottic chart into the feature sequence to be input into the prediction model, outputting the predicted phoneme by the model, taking out the corresponding parameters of the phoneme from a dictionary which stores 32 basic phonemes and corresponding synthesis parameters of Chinese, and configuring a Klatt formant voice synthesis model to realize the conversion from the electroglottic chart to the voice.
CN201911065541.1A 2019-11-04 2019-11-04 Electroglottography voice conversion method based on LSTM Active CN110808026B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911065541.1A CN110808026B (en) 2019-11-04 2019-11-04 Electroglottography voice conversion method based on LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911065541.1A CN110808026B (en) 2019-11-04 2019-11-04 Electroglottography voice conversion method based on LSTM

Publications (2)

Publication Number Publication Date
CN110808026A CN110808026A (en) 2020-02-18
CN110808026B true CN110808026B (en) 2022-08-23

Family

ID=69501069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911065541.1A Active CN110808026B (en) 2019-11-04 2019-11-04 Electroglottography voice conversion method based on LSTM

Country Status (1)

Country Link
CN (1) CN110808026B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069816A (en) * 2020-09-14 2020-12-11 深圳市北科瑞声科技股份有限公司 Chinese punctuation adding method, system and equipment
CN113409809B (en) * 2021-07-07 2023-04-07 上海新氦类脑智能科技有限公司 Voice noise reduction method, device and equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538024A (en) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus and equipment
CN106057192A (en) * 2016-07-07 2016-10-26 Tcl集团股份有限公司 Real-time voice conversion method and apparatus
CN108766413A (en) * 2018-05-25 2018-11-06 北京云知声信息技术有限公司 Phoneme synthesizing method and system
CN108831463A (en) * 2018-06-28 2018-11-16 广州华多网络科技有限公司 Lip reading synthetic method, device, electronic equipment and storage medium
CN108836574A (en) * 2018-06-20 2018-11-20 广州智能装备研究院有限公司 It is a kind of to utilize neck vibrator work intelligent sounding system and its vocal technique
WO2018209556A1 (en) * 2017-05-16 2018-11-22 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for speech synthesis
CN109599092A (en) * 2018-12-21 2019-04-09 秒针信息技术有限公司 A kind of audio synthetic method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8744854B1 (en) * 2012-09-24 2014-06-03 Chengjun Julian Chen System and method for voice transformation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538024A (en) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus and equipment
CN106057192A (en) * 2016-07-07 2016-10-26 Tcl集团股份有限公司 Real-time voice conversion method and apparatus
WO2018209556A1 (en) * 2017-05-16 2018-11-22 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for speech synthesis
CN108766413A (en) * 2018-05-25 2018-11-06 北京云知声信息技术有限公司 Phoneme synthesizing method and system
CN108836574A (en) * 2018-06-20 2018-11-20 广州智能装备研究院有限公司 It is a kind of to utilize neck vibrator work intelligent sounding system and its vocal technique
CN108831463A (en) * 2018-06-28 2018-11-16 广州华多网络科技有限公司 Lip reading synthetic method, device, electronic equipment and storage medium
CN109599092A (en) * 2018-12-21 2019-04-09 秒针信息技术有限公司 A kind of audio synthetic method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
胡琼等.利用逆滤波和相平面获取高自然声门波的研究.《电声技术》.2011,(第05期),第59-63,73页. *
陈立江 等.结合电声门图的语音合成研究.《第十二届全国人机语音通讯学术会议》.2013, *

Also Published As

Publication number Publication date
CN110808026A (en) 2020-02-18

Similar Documents

Publication Publication Date Title
CN111489734B (en) Model training method and device based on multiple speakers
Yu et al. DurIAN: Duration Informed Attention Network for Speech Synthesis.
JP7464621B2 (en) Speech synthesis method, device, and computer-readable storage medium
CN112037754B (en) Method for generating speech synthesis training data and related equipment
Tokuda et al. Speech synthesis based on hidden Markov models
CN109147758A (en) A kind of speaker's sound converting method and device
CN111179905A (en) Rapid dubbing generation method and device
JP2000504849A (en) Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
CN106971709A (en) Statistic parameter model method for building up and device, phoneme synthesizing method and device
Yin et al. Modeling F0 trajectories in hierarchically structured deep neural networks
CN110808026B (en) Electroglottography voice conversion method based on LSTM
CN113450761A (en) Parallel speech synthesis method and device based on variational self-encoder
Reddy et al. Inverse filter based excitation model for HMM‐based speech synthesis system
JPH1165590A (en) Voice recognition dialing device
Prasad et al. Backend tools for speech synthesis in speech processing
JPWO2010104040A1 (en) Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis
CN114708876A (en) Audio processing method and device, electronic equipment and storage medium
US11670292B2 (en) Electronic device, method and computer program
CN114203151A (en) Method, device and equipment for training speech synthesis model
Takaki et al. Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012
Eshghi et al. An Investigation of Features for Fundamental Frequency Pattern Prediction in Electrolaryngeal Speech Enhancement
JAIN Advanced Feature Extraction and Its Implementation in Speech Recognition System
Vargas et al. Cascade prediction filters with adaptive zeros to track the time-varying resonances of the vocal tract
Dalva Automatic speech recognition system for Turkish spoken language
CN116168687B (en) Voice data processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant