CN110808026B - Electroglottography voice conversion method based on LSTM - Google Patents
Electroglottography voice conversion method based on LSTM Download PDFInfo
- Publication number
- CN110808026B CN110808026B CN201911065541.1A CN201911065541A CN110808026B CN 110808026 B CN110808026 B CN 110808026B CN 201911065541 A CN201911065541 A CN 201911065541A CN 110808026 B CN110808026 B CN 110808026B
- Authority
- CN
- China
- Prior art keywords
- voice
- phoneme
- model
- standard
- lstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 25
- 230000015572 biosynthetic process Effects 0.000 claims description 8
- 238000003786 synthesis reaction Methods 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims 1
- 238000004364 calculation method Methods 0.000 abstract description 7
- 230000002194 synthesizing effect Effects 0.000 abstract description 5
- 230000000694 effects Effects 0.000 abstract description 4
- 239000000284 extract Substances 0.000 abstract description 4
- 238000011156 evaluation Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 7
- 210000001260 vocal cord Anatomy 0.000 description 6
- 239000013598 vector Substances 0.000 description 4
- 241001672694 Citrus reticulata Species 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 210000000867 larynx Anatomy 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention provides an electroacoustic glottal image voice conversion method based on LSTM, which comprises the steps of firstly extracting features of an electroacoustic glottal image and splicing, further designing the similarity of converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted by the electroacoustic glottal image through the trained model and synthesizing voice. The invention extracts and splices the electroglottography characteristics, combines with a standard phoneme sequence obtained by disassembling an LSTM network and standard voice data to obtain a prediction model which takes the electroglography characteristics sequence as input and outputs and predicts the current phoneme, solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of standard voice and converted voice for designing a loss function used by the training model, and simultaneously adopts a Klatt formant voice synthesizer and configures a formant filter to obtain real voice.
Description
Technical Field
The invention designs an LSTM-based electroglottography voice conversion method, which can predict a voice to be synthesized at present by acquiring the input of electroglography data at the present moment and the past moment, and belongs to the field of computers.
Background
An electroglottic graph (EGG for short) is the vocal cord movement information of the larynx when speaking collected by two electrodes placed on the larynx, has extremely high correlation with the voice information sent by a person, and the characteristics extracted from the vocal cord movement information can be used for recovering the corresponding voice information.
Formant speech synthesis techniques are currently the more mature speech synthesis techniques. Formant speech synthesis utilizes the resonance characteristics of the acoustic channel to the speech excitation, and can form a formant filter by extracting each formant frequency of the acoustic channel and the bandwidth thereof as parameters. The parameters of the formant filter are configured, so that different voices can be controlled and synthesized.
In practical applications, many patients have difficulty making sounds for different reasons, but their vocal cords can still vibrate, and if the voice can be synthesized by extracting the electroglottography of the patient, the ability of the patient to resume communication can be greatly assisted.
Disclosure of Invention
In order to recover speech data from electroglottography data, the invention proposes an LSTM-based electroglography speech conversion method.
The invention provides an LSTM-based electroacoustic glottal image voice conversion method, which comprises the steps of,
step A: and extracting characteristics of the electroglottography and splicing.
The electric glottal graph detects the closing and separating of the vocal cords by detecting the impedance when the vocal cords vibrate, reflects the regularity of the vocal cords vibration, and contains rich characteristics related to voice. In order to realize the prediction of the voice, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of the glottal graph signal are selected and extracted as training characteristics. The electroglottis graph signal is a one-dimensional signal taking time as an axis, the electroglottis graph signal is divided into frames with the length of 20ms, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of the electroglottis graph in the frame are calculated, and then the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation are spliced with the calculated characteristics of the first 9 frames, so that the electroglottis graph signal can be converted into a 40-dimensional characteristic sequence.
And B, step B: the similarity between the converted speech and the standard speech is designed.
Designing a method for calculating the similarity between the synthesized speech and the standard speech, wherein the standard speech adopted for calculating the similarity is not sampling data of real speech but a phoneme sequence obtained by decomposing the standard speech; the synthesized speech is also not the actual synthesized speech data, but rather a predicted sequence of phonemes output by the model. By serializing the standard speech and the synthesized speech in the form of phonemes, the problem of speech synthesis is translated into the problem of phoneme prediction for the current time. The similarity calculation problem of the synthesized speech and the standard speech is converted into the similarity calculation problem of the standard phoneme sequence and the predicted phoneme sequence. And adopting cross entropy as a mode for calculating the similarity of the two sequences, wherein the larger the cross entropy is, the lower the similarity is.
And C: and training the phoneme prediction model.
The application also provides a phoneme prediction model design method based on the LSTM (Long-short term memory), and the LSTM model is a special RNN model. In a traditional RNN model, a time-based back propagation algorithm (BPTT) is used for updating model parameters, and when a time interval becomes long, a residual error to be returned will show an exponential decrease, which causes a problem of gradient dispersion, resulting in slow update of network parameters and difficulty in convergence. The LSTM network is proposed to solve the problem that the conventional RNN network is difficult to implement long-term memory.
Firstly, a large amount of linguistic data are prepared, phoneme sequences are extracted from the linguistic data to serve as standard data, electroglottal graph data corresponding to the linguistic data are obtained from a plurality of patients, and the electroglottal graph data are converted into feature sequences to serve as training data of a model. The feature sequence generated by the electroglottis signal and the LSTM network are combined, so that the training of a prediction model and the prediction of phonemes by the prediction model can be realized.
In the training of the model, the electroglottography characteristic sequences corresponding to a batch of linguistic data are input into an LSTM network, a phoneme prediction sequence is further obtained, cross entropy is used as a loss function (LossFunction), and the model is optimized by combining a back propagation and learning rate self-adaptive algorithm.
Step D: and predicting the current phoneme by using the trained model and the feature sequence converted by the electroglottography, and synthesizing the speech.
The vocoder used in the actual application is the Klatt formant vocoder. The Klatt formant speech synthesizer generates various speeches through the control of six formants. The synthesizer uses a series branch to generate vowels, uses a parallel branch to generate consonants, and configures parameters of a parallel filter and a series filter in the Klatt synthesizer and the state of an unvoiced and voiced sound switch, so that corresponding voices can be synthesized. The method comprises the steps of storing 32 basic phonemes of the Mandarin Chinese and corresponding Klatt synthesizer parameters of the basic phonemes into a dictionary as key value pairs in advance, directly reading configuration parameters of the phonemes from the dictionary to configure the Klatt synthesizer according to prediction of current factors by a prediction model, namely obtaining real speech corresponding to the phonemes, and realizing conversion from an electroacoustic glottal image signal to a real speech signal.
The invention provides an electroacoustic glottal image voice conversion method based on LSTM, which comprises the steps of firstly extracting features of an electroacoustic glottal image and splicing, further designing the similarity of converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted by the electroacoustic glottal image through the trained model and synthesizing voice. The invention extracts and splices the characteristic of the electroglottography, and combines the LSTM network and the standard phoneme sequence obtained by the disassembly of the standard voice data to obtain a prediction model which takes the electroglography characteristic sequence as input and outputs and predicts the current phoneme, and solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of the standard voice and the converted voice for designing the loss function used by the training model.
Drawings
FIG. 1 is an overall flow chart of the LSTM-based electroacoustic glottal speech conversion method proposed by the present invention;
FIG. 2 is a flow chart of the present invention for electroglottography signal conversion to a signature sequence;
FIG. 3 is a flowchart of the difference calculation between the converted speech and the standard speech according to the present invention;
FIG. 4 is a flow chart of a phoneme prediction model training process proposed by the present invention;
fig. 5 is a flow chart of the present invention for real and speech synthesis based on predicted phoneme and Klatt synthesizer.
Detailed Description
As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. The following description is of the preferred embodiment for carrying out the invention, and is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.
The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.
Examples
The invention provides an LSTM-based electroacoustic glottal voice conversion method, which comprises the steps of,
step A: extracting characteristics of the electroglottography and splicing;
and B: designing the similarity of the converted voice and the standard voice;
and C: training a phoneme prediction model;
step D: and predicting the current phoneme by using the trained model and the feature sequence converted by the electroglottography, and synthesizing the speech.
As shown in fig. 1, the method for voice conversion of electro-glottography based on LSTM firstly extracts features from the electro-glottography and converts the features into feature sequences of forty dimensions, inputs the feature sequences into a model, simultaneously trains the model by taking the phoneme sequences corresponding to the same time period as labels and taking the cross entropy of the standard phoneme sequences and the prediction sequences as a loss function until the loss function of the model converges, and then completes the training of the prediction model. When the electric glottal graph and the voice are converted, the electric glottal graph is converted into a feature sequence and input into a prediction model, the prediction model outputs prediction phonemes, and Klatt synthesizer configuration parameters corresponding to the phonemes are found from a dictionary to configure a Klatt synthesizer, so that the real voice corresponding to the electric glottal graph can be generated.
As shown in fig. 2, in the method for extracting and splicing the electroglottography characteristics in step a, firstly, the electroglottography signal is sampled, and the sampling rate is 8 KHz. The sampling data of the electroglottis image is divided into frames according to the length of 20ms, filtering processing is carried out, the fundamental frequency, the unit time energy, the frequency perturbation and the amplitude perturbation of each frame are calculated, and the calculated characteristics of the frame are spliced with the calculation results of the previous 9 frames to form 40-dimensional characteristic vectors corresponding to the frame.
As shown in fig. 3, the similarity of the converted speech and the standard speech in step B is calculated. Standard speech is first converted into a standard mandarin chinese phone sequence and the phones are then subjected to one-hot encoding, i.e., one phone is converted into a 32-dimensional vector. The prediction model predicts the current phoneme to obtain a 32-dimensional probability vector, and takes the phoneme with the highest probability as output. The two vectors are used as cross entropy to measure the similarity between the converted speech and the standard speech. The smaller the cross entropy is, the higher the similarity of the zero sequence is, and the better the prediction effect of the model is.
As shown in fig. 4, the phoneme prediction model is trained in step C. Firstly, converting electroglottal graph data and linguistic data in a database into an electroglottal graph feature sequence and a standard phoneme sequence, wherein the electroglottal graph feature sequence is used as training input of a prediction model, and the standard phoneme is used as a training label. The loss function is designed using the method described in step B. In the aspect of parameter optimization of the model, the model adopts a batch training method, 128 sentences randomly selected each time are used as a batch of data to be trained, and an adaptive moment estimation method (adaptive moment estimation) is adopted to update the learning rate.
In step D, the current phoneme is predicted by using the trained model and the feature sequence converted by the electroglottography, and the speech is synthesized according to the prediction. As shown in fig. 5, a flow of converting phonemes predicted by a prediction model into real speech is shown. The speech synthesis apparatus used in this application is a Klatt formant synthesizer. The Klatt formant synthesizer is a hybrid speech synthesizer that can synthesize different voices by configuring the parameters of its series and parallel filters. The method stores 32 standard Chinese mandarin basic phonemes and corresponding Klatt formant synthesizer configuration parameters into key value pairs in a dictionary in advance. When the electroglottography voice conversion is carried out, the configuration parameters are taken out from the dictionary by taking the predicted phonemes as keys, and a Klatt formant synthesizer is configured, so that the voice corresponding to the phonemes is obtained.
The invention provides an LSTM-based electroacoustic glottin voice conversion method, which comprises the steps of firstly extracting and splicing features of an electroacoustic glottin, further designing the similarity between converted voice and standard voice, further training a phoneme prediction model, finally predicting a current phoneme by using a feature sequence converted from the electroacoustic glottin by using the trained model and synthesizing voice. The invention extracts and splices the characteristic of the electroglottography, and combines the LSTM network and the standard phoneme sequence obtained by the disassembly of the standard voice data to obtain a prediction model which takes the electroglography characteristic sequence as input and outputs and predicts the current phoneme, and solves the problem of difficult evaluation of the prediction effect of the training model by a calculation method for measuring the similarity of the standard voice and the converted voice for designing the loss function used by the training model.
While the foregoing description shows and describes the preferred embodiments of the present invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not intended to be exhaustive or to exclude other embodiments and may be used in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept described herein, as determined by the above teachings or as determined by the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (5)
1. An LSTM-based electroglottography voice conversion method comprises the steps of,
a: extracting characteristics of the electroglottography and splicing;
b: designing the similarity of the converted voice and the standard voice;
c: training a phoneme prediction model;
d: pre-encoding current phonemes using trained models and feature sequences transformed from electroglottography
And then the voice is synthesized.
2. An LSTM-based electroglottographic speech conversion method as claimed in claim 1, wherein: step A, extracting and splicing the electroglottis image features, namely framing the electroglottis image by taking 20ms as a length, extracting features including but not limited to fundamental frequency, unit time energy, frequency perturbation and amplitude perturbation of each frame, splicing the features extracted from every ten adjacent frames, and converting the electroglottis image into a feature sequence.
3. An LSTM-based electroglottographic speech conversion method as claimed in claim 2, wherein: and step B, designing the similarity between the converted speech and the standard speech, namely converting the standard speech into a standard Chinese Putonghua phoneme sequence, and calculating the similarity with a prediction model output sequence by using cross entropy.
4. An LSTM-based electroglottographic speech conversion method according to any one of claims 1-3, characterized in that: and step C, training the phoneme prediction model, namely taking an LSTM network as the prediction model, inputting the training model as an electroacoustic glottal characteristic sequence, outputting the model as a phoneme prediction sequence, taking a training label as a standard phoneme sequence, and training the prediction model by taking cross entropy as a loss function.
5. An LSTM-based electroglottographic speech conversion method according to claim 4, characterized in that: step D, using the trained model to predict the current phoneme by the feature sequence converted by the electroglottic chart and synthesize the voice by the prediction, firstly converting the electroglottic chart into the feature sequence to be input into the prediction model, outputting the predicted phoneme by the model, taking out the corresponding parameters of the phoneme from a dictionary which stores 32 basic phonemes and corresponding synthesis parameters of Chinese, and configuring a Klatt formant voice synthesis model to realize the conversion from the electroglottic chart to the voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911065541.1A CN110808026B (en) | 2019-11-04 | 2019-11-04 | Electroglottography voice conversion method based on LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911065541.1A CN110808026B (en) | 2019-11-04 | 2019-11-04 | Electroglottography voice conversion method based on LSTM |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110808026A CN110808026A (en) | 2020-02-18 |
CN110808026B true CN110808026B (en) | 2022-08-23 |
Family
ID=69501069
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911065541.1A Active CN110808026B (en) | 2019-11-04 | 2019-11-04 | Electroglottography voice conversion method based on LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110808026B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112069816A (en) * | 2020-09-14 | 2020-12-11 | 深圳市北科瑞声科技股份有限公司 | Chinese punctuation adding method, system and equipment |
CN113409809B (en) * | 2021-07-07 | 2023-04-07 | 上海新氦类脑智能科技有限公司 | Voice noise reduction method, device and equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104538024A (en) * | 2014-12-01 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | Speech synthesis method, apparatus and equipment |
CN106057192A (en) * | 2016-07-07 | 2016-10-26 | Tcl集团股份有限公司 | Real-time voice conversion method and apparatus |
CN108766413A (en) * | 2018-05-25 | 2018-11-06 | 北京云知声信息技术有限公司 | Phoneme synthesizing method and system |
CN108831463A (en) * | 2018-06-28 | 2018-11-16 | 广州华多网络科技有限公司 | Lip reading synthetic method, device, electronic equipment and storage medium |
CN108836574A (en) * | 2018-06-20 | 2018-11-20 | 广州智能装备研究院有限公司 | It is a kind of to utilize neck vibrator work intelligent sounding system and its vocal technique |
WO2018209556A1 (en) * | 2017-05-16 | 2018-11-22 | Beijing Didi Infinity Technology And Development Co., Ltd. | System and method for speech synthesis |
CN109599092A (en) * | 2018-12-21 | 2019-04-09 | 秒针信息技术有限公司 | A kind of audio synthetic method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8744854B1 (en) * | 2012-09-24 | 2014-06-03 | Chengjun Julian Chen | System and method for voice transformation |
-
2019
- 2019-11-04 CN CN201911065541.1A patent/CN110808026B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104538024A (en) * | 2014-12-01 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | Speech synthesis method, apparatus and equipment |
CN106057192A (en) * | 2016-07-07 | 2016-10-26 | Tcl集团股份有限公司 | Real-time voice conversion method and apparatus |
WO2018209556A1 (en) * | 2017-05-16 | 2018-11-22 | Beijing Didi Infinity Technology And Development Co., Ltd. | System and method for speech synthesis |
CN108766413A (en) * | 2018-05-25 | 2018-11-06 | 北京云知声信息技术有限公司 | Phoneme synthesizing method and system |
CN108836574A (en) * | 2018-06-20 | 2018-11-20 | 广州智能装备研究院有限公司 | It is a kind of to utilize neck vibrator work intelligent sounding system and its vocal technique |
CN108831463A (en) * | 2018-06-28 | 2018-11-16 | 广州华多网络科技有限公司 | Lip reading synthetic method, device, electronic equipment and storage medium |
CN109599092A (en) * | 2018-12-21 | 2019-04-09 | 秒针信息技术有限公司 | A kind of audio synthetic method and device |
Non-Patent Citations (2)
Title |
---|
胡琼等.利用逆滤波和相平面获取高自然声门波的研究.《电声技术》.2011,(第05期),第59-63,73页. * |
陈立江 等.结合电声门图的语音合成研究.《第十二届全国人机语音通讯学术会议》.2013, * |
Also Published As
Publication number | Publication date |
---|---|
CN110808026A (en) | 2020-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111489734B (en) | Model training method and device based on multiple speakers | |
Yu et al. | DurIAN: Duration Informed Attention Network for Speech Synthesis. | |
JP7464621B2 (en) | Speech synthesis method, device, and computer-readable storage medium | |
CN112037754B (en) | Method for generating speech synthesis training data and related equipment | |
Tokuda et al. | Speech synthesis based on hidden Markov models | |
CN109147758A (en) | A kind of speaker's sound converting method and device | |
CN111179905A (en) | Rapid dubbing generation method and device | |
JP2000504849A (en) | Speech coding, reconstruction and recognition using acoustics and electromagnetic waves | |
CN106971709A (en) | Statistic parameter model method for building up and device, phoneme synthesizing method and device | |
Yin et al. | Modeling F0 trajectories in hierarchically structured deep neural networks | |
CN110808026B (en) | Electroglottography voice conversion method based on LSTM | |
CN113450761A (en) | Parallel speech synthesis method and device based on variational self-encoder | |
Reddy et al. | Inverse filter based excitation model for HMM‐based speech synthesis system | |
JPH1165590A (en) | Voice recognition dialing device | |
Prasad et al. | Backend tools for speech synthesis in speech processing | |
JPWO2010104040A1 (en) | Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis | |
CN114708876A (en) | Audio processing method and device, electronic equipment and storage medium | |
US11670292B2 (en) | Electronic device, method and computer program | |
CN114203151A (en) | Method, device and equipment for training speech synthesis model | |
Takaki et al. | Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012 | |
Eshghi et al. | An Investigation of Features for Fundamental Frequency Pattern Prediction in Electrolaryngeal Speech Enhancement | |
JAIN | Advanced Feature Extraction and Its Implementation in Speech Recognition System | |
Vargas et al. | Cascade prediction filters with adaptive zeros to track the time-varying resonances of the vocal tract | |
Dalva | Automatic speech recognition system for Turkish spoken language | |
CN116168687B (en) | Voice data processing method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |