CN110767210A - Method and device for generating personalized voice - Google Patents
Method and device for generating personalized voice Download PDFInfo
- Publication number
- CN110767210A CN110767210A CN201911046823.7A CN201911046823A CN110767210A CN 110767210 A CN110767210 A CN 110767210A CN 201911046823 A CN201911046823 A CN 201911046823A CN 110767210 A CN110767210 A CN 110767210A
- Authority
- CN
- China
- Prior art keywords
- model
- voice
- personalized
- training
- vocoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 239000013598 vector Substances 0.000 claims abstract description 37
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 10
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims description 43
- 238000000605 extraction Methods 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000013135 deep learning Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 claims description 2
- 230000000306 recurrent effect Effects 0.000 claims description 2
- 208000013409 limited attention Diseases 0.000 claims 1
- 239000003795 chemical substances by application Substances 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a method and a device for generating personalized voice. The method has the advantages that the characteristic of the target voice is combined with the text characteristic vector, the end-to-end text characteristic-to-audio characteristic unit is used for conducting self-adaptive learning on the generated mixed end-to-end model, and the method is equivalent to conducting self-adaptive learning on the input closest to the target voice characteristic. Through the personalized vocoder unit, the loss of the vocoder synthesis agent is reduced, and the naturalness of the voice synthesis is improved.
Description
Technical Field
The invention relates to the technical field of voice personalization, in particular to a method and a device for generating personalized voice.
Background
With the development of intelligent home, the voice personalization technology is applied in more and more fields. The development of the voice broadcasting technology greatly facilitates the life of people and improves the quality of life. Most of the existing voice personalization technologies perform matrix change after voiceprint features are extracted through parallel corpora of a personalized target and the person, for example, DTW has high requirements on the number of the voice corpora and consumes much time.
Disclosure of Invention
In view of the above, the invention provides a small corpus personalization method and device based on speaker characteristics, and solves the problems that the existing voice personalization algorithm needs clear voice data and the training time is long.
The invention solves the problems through the following technical scheme: a method of generating personalized speech, the method comprising the steps of:
step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices;
b, training a voice feature extraction model by using sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors;
c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice;
step d, inputting the acoustic features generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model;
step e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model;
f, generating acoustic characteristics of a target by using the personalized end-to-end model, carrying out adaptive model training on the average model of the vocoder, and training a personalized vocoder model;
and g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice.
Preferably, in the step b, the speech feature extraction model is to input the sample acoustic features obtained in the step a into a deep speech recognition model, and then train the sample acoustic features with a deep learning network to obtain sample acoustic feature vectors corresponding to different sample acoustic features.
Preferably, the deep learning network comprises: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.
Preferably, in step c, an end-to-end neural network is adopted, the feature vectors of the text and the feature vectors of the acoustic features are combined, a limited range attention mechanism is used in the end-to-end network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features of the output end.
Preferably, in step f, a cyclic neural network is used to predict the coding value of the audio according to the acoustic features of the target, a personalized vocoder model is trained in combination with the output target audio, and a fuzzy process is used to generate the acoustic features during training, and a small amount of interference spectrum is inserted.
Further, the present invention also provides an apparatus for generating personalized speech, which is characterized in that, by using the foregoing method for generating personalized speech, the apparatus further includes: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.
The invention has the beneficial effects that: the invention can be applied to the field of speech individuation, but not limited to the field.
Drawings
FIG. 1 is a block diagram of a process for training models in generating personalized speech according to an embodiment of the present invention;
FIG. 2 is a main framework for generating a personalized voice network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a voice feature extraction network according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an end-to-end network provided by an embodiment of the present invention;
fig. 5 is a block diagram of an apparatus for generating personalized speech according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
In a first embodiment, referring to fig. 1 and 2, the present invention provides a method for generating personalized speech, the method comprising the steps of:
step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices. The method comprises the steps of collecting voices of multiple characters in a recording studio as large-scale sample voices, enabling the sampling frequency of the voices to be larger than 16000Hz as much as possible, and enabling collected target sample voices to contain all Chinese initials, finals and tones to be combined into Chinese phonemes as much as possible. The extracted acoustic features of the sample comprise Mel features, linear prediction coefficient features and the like, the Mel features are extracted by adopting windowed framed Fourier transform, a time domain is converted into a frequency domain by means of windowed Fourier transform, when the audio features are extracted, the Mel features are 40-80-dimensional, and the linear prediction coefficient feature input limits N scale cepstrum coefficients and M pitch parameters (such as period, correlation and the like).
And b, training a voice feature extraction model by using the sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors. Referring to fig. 3, the speech feature extraction model is to input the sample acoustic features obtained in step a into a deep speech recognition model, and then train the sample acoustic features with a deep learning network to obtain sample acoustic feature vectors corresponding to different sample acoustic features. The deep learning network includes: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.
And c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice. Referring to fig. 2 and 4, an end-to-end neural network is adopted, the feature vectors of the text and the feature vectors of the acoustic features are combined, a limited range attention mechanism is used in the end-to-end network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features of the output end. The acoustic feature vectors of the large scale sample speech are obtained from the audio acoustic features of the large scale sample speech, preferably at an audio sampling rate of 22k or more.
And d, inputting the acoustic characteristics generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model. And predicting the coding value of the audio by adopting a neural network according to the target acoustic characteristics, and outputting the target audio in a combined manner.
And e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model. Generating a feature vector of target sample voice through a feature extraction network, combining the voice feature vector generated by the target with a text vector corresponding to the target sample voice to serve as the input of a hybrid model, performing self-adaptive learning on the hybrid model by taking the linear prediction coefficient feature of the target as the output, and obtaining a personalized end-to-end model
And f, generating the acoustic characteristics of the target by using the personalized end-to-end model, and performing adaptive model training on the average model of the vocoder to train the personalized vocoder model. And predicting the coding value of the audio by adopting a recurrent neural network according to the acoustic characteristics of the target, training a personalized vocoder model by combining the output target audio, and inserting a small amount of interference spectrum into the generated acoustic characteristics by adopting fuzzy processing in the training.
In the personalized process, the voice feature extraction model does not need to carry out self-adaptive training operation, and the personalized end-to-end model and the personalized vocoder model of the text-to-voice feature need to carry out self-adaptive learning.
And g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice. The acoustic feature vector of the target may be an acoustic feature vector of the target obtained by the speech feature extraction model, or a generated feature vector generated by the target before the speech feature extraction model is used.
Meanwhile, an end-to-end network adopted by the hybrid end-to-end model and the personalized end-to-end model is shown in fig. 4, and specifically, text features of phonemes corresponding to text conversion and voice feature vectors generated in the previous step are combined to serve as input of the end-to-end network.
The end-to-end network is divided into three parts, namely an encoder, a decoder and a back-end processor. An attention mechanism is adopted in the middle of the decoder, a window is arranged around the previous maximum weight point, the next maximum weight point is searched, and the alignment efficiency is improved. The method adopts a cyclic neural network structure, improves the dimensionality of sound features, increases the feature vectors of loss functions, improves the fitting effect of training, and adopts the Mel features to improve the naturalness of synthesized sound.
In the second embodiment, the invention further provides a device for generating personalized voice, which is shown in fig. 5. The device can adopt the method for generating the personalized voice, and the device further comprises: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.
Through the device provided by the second embodiment, the audio frequency of the sound is self-adaptively learned on the basic model through a small amount of linguistic data, the user can complete the personalization of the voice signal in a short time without adopting other linguistic data, and the MOS (mean opinion score) of the synthesized sound is as high as about 4.0
According to the method, the characteristic of the target voice is combined with the text characteristic vector, the self-adaptive learning is carried out on the generated mixed end-to-end model through the end-to-end text characteristic to audio characteristic converting unit, which is equivalent to the self-adaptive learning on the input closest to the target voice characteristic, the time required by the self-adaptive learning is reduced through the method, the feedback loss of the neural network fitting is reduced, the adjustment amplitude of the neural network is reduced, and the accuracy of the self-adaptive learning is improved. Through the personalized vocoder unit, the loss of the vocoder synthesis agent is reduced, and the naturalness of the voice synthesis is improved.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.
Claims (6)
1. A method of generating personalized speech, the method comprising the steps of:
step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices;
b, training a voice feature extraction model by using sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors;
c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice;
step d, inputting the acoustic features generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model;
step e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model;
f, generating acoustic characteristics of a target by using the personalized end-to-end model, carrying out adaptive model training on the average model of the vocoder, and training a personalized vocoder model;
and g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice.
2. The method of claim 1, wherein in the step b, the speech feature extraction model is to input the acoustic features of the samples obtained in the step a into a deep speech recognition model, and train the model with a deep learning network to obtain acoustic feature vectors of the samples corresponding to the acoustic features of different samples.
3. The method of generating personalized speech according to claim 2, wherein the deep learning network comprises: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.
4. The method of claim 1, wherein in step c, an end-to-end neural network is adopted to combine the feature vectors of the text and the acoustic features, a limited attention mechanism is used in the end-to-end neural network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features at the output end.
5. The method of claim 1, wherein in step f, a recurrent neural network is used to predict the encoded value of the audio according to the acoustic features of the target, a personalized vocoder model is trained in conjunction with the output target audio, and a blurring process is used to interpolate a small amount of interference spectrum to the generated acoustic features during the training.
6. An apparatus for generating personalized speech, wherein the method of any one of claims 1-5 is employed, the apparatus further comprising: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911046823.7A CN110767210A (en) | 2019-10-30 | 2019-10-30 | Method and device for generating personalized voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911046823.7A CN110767210A (en) | 2019-10-30 | 2019-10-30 | Method and device for generating personalized voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110767210A true CN110767210A (en) | 2020-02-07 |
Family
ID=69334723
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911046823.7A Pending CN110767210A (en) | 2019-10-30 | 2019-10-30 | Method and device for generating personalized voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110767210A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
CN111462727A (en) * | 2020-03-31 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device and computer readable medium for generating speech |
CN111462728A (en) * | 2020-03-31 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device and computer readable medium for generating speech |
CN111739536A (en) * | 2020-05-09 | 2020-10-02 | 北京捷通华声科技股份有限公司 | Audio processing method and device |
CN111785258A (en) * | 2020-07-13 | 2020-10-16 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN112687296A (en) * | 2021-03-10 | 2021-04-20 | 北京世纪好未来教育科技有限公司 | Audio disfluency identification method, device, equipment and readable storage medium |
WO2021169825A1 (en) * | 2020-02-25 | 2021-09-02 | 阿里巴巴集团控股有限公司 | Speech synthesis method and apparatus, device and storage medium |
CN113409767A (en) * | 2021-05-14 | 2021-09-17 | 北京达佳互联信息技术有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN113488057A (en) * | 2021-08-18 | 2021-10-08 | 山东新一代信息产业技术研究院有限公司 | Health-oriented conversation implementation method and system |
WO2022094740A1 (en) * | 2020-11-03 | 2022-05-12 | Microsoft Technology Licensing, Llc | Controlled training and use of text-to-speech models and personalized model generated voices |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140114663A1 (en) * | 2012-10-19 | 2014-04-24 | Industrial Technology Research Institute | Guided speaker adaptive speech synthesis system and method and computer program product |
JP2015018080A (en) * | 2013-07-10 | 2015-01-29 | 日本電信電話株式会社 | Speech synthesis model learning device and speech synthesis device, and method and program thereof |
CN107481713A (en) * | 2017-07-17 | 2017-12-15 | 清华大学 | A kind of hybrid language phoneme synthesizing method and device |
CN107564511A (en) * | 2017-09-25 | 2018-01-09 | 平安科技(深圳)有限公司 | Electronic installation, phoneme synthesizing method and computer-readable recording medium |
CN109346056A (en) * | 2018-09-20 | 2019-02-15 | 中国科学院自动化研究所 | Phoneme synthesizing method and device based on depth measure network |
CN110148398A (en) * | 2019-05-16 | 2019-08-20 | 平安科技(深圳)有限公司 | Training method, device, equipment and the storage medium of speech synthesis model |
CN110379411A (en) * | 2018-04-11 | 2019-10-25 | 阿里巴巴集团控股有限公司 | For the phoneme synthesizing method and device of target speaker |
-
2019
- 2019-10-30 CN CN201911046823.7A patent/CN110767210A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140114663A1 (en) * | 2012-10-19 | 2014-04-24 | Industrial Technology Research Institute | Guided speaker adaptive speech synthesis system and method and computer program product |
JP2015018080A (en) * | 2013-07-10 | 2015-01-29 | 日本電信電話株式会社 | Speech synthesis model learning device and speech synthesis device, and method and program thereof |
CN107481713A (en) * | 2017-07-17 | 2017-12-15 | 清华大学 | A kind of hybrid language phoneme synthesizing method and device |
CN107564511A (en) * | 2017-09-25 | 2018-01-09 | 平安科技(深圳)有限公司 | Electronic installation, phoneme synthesizing method and computer-readable recording medium |
CN110379411A (en) * | 2018-04-11 | 2019-10-25 | 阿里巴巴集团控股有限公司 | For the phoneme synthesizing method and device of target speaker |
CN109346056A (en) * | 2018-09-20 | 2019-02-15 | 中国科学院自动化研究所 | Phoneme synthesizing method and device based on depth measure network |
CN110148398A (en) * | 2019-05-16 | 2019-08-20 | 平安科技(深圳)有限公司 | Training method, device, equipment and the storage medium of speech synthesis model |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
WO2021169825A1 (en) * | 2020-02-25 | 2021-09-02 | 阿里巴巴集团控股有限公司 | Speech synthesis method and apparatus, device and storage medium |
CN111462727A (en) * | 2020-03-31 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device and computer readable medium for generating speech |
CN111462728A (en) * | 2020-03-31 | 2020-07-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device and computer readable medium for generating speech |
CN111739536A (en) * | 2020-05-09 | 2020-10-02 | 北京捷通华声科技股份有限公司 | Audio processing method and device |
CN111785258A (en) * | 2020-07-13 | 2020-10-16 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN111785258B (en) * | 2020-07-13 | 2022-02-01 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
WO2022094740A1 (en) * | 2020-11-03 | 2022-05-12 | Microsoft Technology Licensing, Llc | Controlled training and use of text-to-speech models and personalized model generated voices |
CN112687296A (en) * | 2021-03-10 | 2021-04-20 | 北京世纪好未来教育科技有限公司 | Audio disfluency identification method, device, equipment and readable storage medium |
CN113409767A (en) * | 2021-05-14 | 2021-09-17 | 北京达佳互联信息技术有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN113488057A (en) * | 2021-08-18 | 2021-10-08 | 山东新一代信息产业技术研究院有限公司 | Health-oriented conversation implementation method and system |
CN113488057B (en) * | 2021-08-18 | 2023-11-14 | 山东新一代信息产业技术研究院有限公司 | Conversation realization method and system for health care |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110767210A (en) | Method and device for generating personalized voice | |
Han et al. | Semantic-preserved communication system for highly efficient speech transmission | |
Kelly et al. | Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors | |
CN113450761B (en) | Parallel voice synthesis method and device based on variation self-encoder | |
CN114023300A (en) | Chinese speech synthesis method based on diffusion probability model | |
KR102272554B1 (en) | Method and system of text to multiple speech | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN116994553A (en) | Training method of speech synthesis model, speech synthesis method, device and equipment | |
CN111326170A (en) | Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution | |
KR20190135853A (en) | Method and system of text to multiple speech | |
CN111724809A (en) | Vocoder implementation method and device based on variational self-encoder | |
CN117041430B (en) | Method and device for improving outbound quality and robustness of intelligent coordinated outbound system | |
CN116994600B (en) | Method and system for driving character mouth shape based on audio frequency | |
Oura et al. | Deep neural network based real-time speech vocoder with periodic and aperiodic inputs | |
WO2022228704A1 (en) | Decoder | |
CN117409761A (en) | Method, device, equipment and storage medium for synthesizing voice based on frequency modulation | |
Mei et al. | A particular character speech synthesis system based on deep learning | |
CN112767912A (en) | Cross-language voice conversion method and device, computer equipment and storage medium | |
Fujiwara et al. | Data augmentation based on frequency warping for recognition of cleft palate speech | |
CN115359775A (en) | End-to-end tone and emotion migration Chinese voice cloning method | |
CN115359778A (en) | Confrontation and meta-learning method based on speaker emotion voice synthesis model | |
Nikitaras et al. | Fine-grained noise control for multispeaker speech synthesis | |
Nijhawan et al. | Real time speaker recognition system for hindi words | |
CN117909486B (en) | Multi-mode question-answering method and system based on emotion recognition and large language model | |
Avikal et al. | Estimation of age from speech using excitation source features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200207 |
|
RJ01 | Rejection of invention patent application after publication |