CN110767210A - Method and device for generating personalized voice - Google Patents

Method and device for generating personalized voice Download PDF

Info

Publication number
CN110767210A
CN110767210A CN201911046823.7A CN201911046823A CN110767210A CN 110767210 A CN110767210 A CN 110767210A CN 201911046823 A CN201911046823 A CN 201911046823A CN 110767210 A CN110767210 A CN 110767210A
Authority
CN
China
Prior art keywords
model
voice
personalized
training
vocoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911046823.7A
Other languages
Chinese (zh)
Inventor
周琳岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201911046823.7A priority Critical patent/CN110767210A/en
Publication of CN110767210A publication Critical patent/CN110767210A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a method and a device for generating personalized voice. The method has the advantages that the characteristic of the target voice is combined with the text characteristic vector, the end-to-end text characteristic-to-audio characteristic unit is used for conducting self-adaptive learning on the generated mixed end-to-end model, and the method is equivalent to conducting self-adaptive learning on the input closest to the target voice characteristic. Through the personalized vocoder unit, the loss of the vocoder synthesis agent is reduced, and the naturalness of the voice synthesis is improved.

Description

Method and device for generating personalized voice
Technical Field
The invention relates to the technical field of voice personalization, in particular to a method and a device for generating personalized voice.
Background
With the development of intelligent home, the voice personalization technology is applied in more and more fields. The development of the voice broadcasting technology greatly facilitates the life of people and improves the quality of life. Most of the existing voice personalization technologies perform matrix change after voiceprint features are extracted through parallel corpora of a personalized target and the person, for example, DTW has high requirements on the number of the voice corpora and consumes much time.
Disclosure of Invention
In view of the above, the invention provides a small corpus personalization method and device based on speaker characteristics, and solves the problems that the existing voice personalization algorithm needs clear voice data and the training time is long.
The invention solves the problems through the following technical scheme: a method of generating personalized speech, the method comprising the steps of:
step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices;
b, training a voice feature extraction model by using sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors;
c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice;
step d, inputting the acoustic features generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model;
step e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model;
f, generating acoustic characteristics of a target by using the personalized end-to-end model, carrying out adaptive model training on the average model of the vocoder, and training a personalized vocoder model;
and g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice.
Preferably, in the step b, the speech feature extraction model is to input the sample acoustic features obtained in the step a into a deep speech recognition model, and then train the sample acoustic features with a deep learning network to obtain sample acoustic feature vectors corresponding to different sample acoustic features.
Preferably, the deep learning network comprises: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.
Preferably, in step c, an end-to-end neural network is adopted, the feature vectors of the text and the feature vectors of the acoustic features are combined, a limited range attention mechanism is used in the end-to-end network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features of the output end.
Preferably, in step f, a cyclic neural network is used to predict the coding value of the audio according to the acoustic features of the target, a personalized vocoder model is trained in combination with the output target audio, and a fuzzy process is used to generate the acoustic features during training, and a small amount of interference spectrum is inserted.
Further, the present invention also provides an apparatus for generating personalized speech, which is characterized in that, by using the foregoing method for generating personalized speech, the apparatus further includes: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.
The invention has the beneficial effects that: the invention can be applied to the field of speech individuation, but not limited to the field.
Drawings
FIG. 1 is a block diagram of a process for training models in generating personalized speech according to an embodiment of the present invention;
FIG. 2 is a main framework for generating a personalized voice network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a voice feature extraction network according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an end-to-end network provided by an embodiment of the present invention;
fig. 5 is a block diagram of an apparatus for generating personalized speech according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
In a first embodiment, referring to fig. 1 and 2, the present invention provides a method for generating personalized speech, the method comprising the steps of:
step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices. The method comprises the steps of collecting voices of multiple characters in a recording studio as large-scale sample voices, enabling the sampling frequency of the voices to be larger than 16000Hz as much as possible, and enabling collected target sample voices to contain all Chinese initials, finals and tones to be combined into Chinese phonemes as much as possible. The extracted acoustic features of the sample comprise Mel features, linear prediction coefficient features and the like, the Mel features are extracted by adopting windowed framed Fourier transform, a time domain is converted into a frequency domain by means of windowed Fourier transform, when the audio features are extracted, the Mel features are 40-80-dimensional, and the linear prediction coefficient feature input limits N scale cepstrum coefficients and M pitch parameters (such as period, correlation and the like).
And b, training a voice feature extraction model by using the sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors. Referring to fig. 3, the speech feature extraction model is to input the sample acoustic features obtained in step a into a deep speech recognition model, and then train the sample acoustic features with a deep learning network to obtain sample acoustic feature vectors corresponding to different sample acoustic features. The deep learning network includes: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.
And c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice. Referring to fig. 2 and 4, an end-to-end neural network is adopted, the feature vectors of the text and the feature vectors of the acoustic features are combined, a limited range attention mechanism is used in the end-to-end network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features of the output end. The acoustic feature vectors of the large scale sample speech are obtained from the audio acoustic features of the large scale sample speech, preferably at an audio sampling rate of 22k or more.
And d, inputting the acoustic characteristics generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model. And predicting the coding value of the audio by adopting a neural network according to the target acoustic characteristics, and outputting the target audio in a combined manner.
And e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model. Generating a feature vector of target sample voice through a feature extraction network, combining the voice feature vector generated by the target with a text vector corresponding to the target sample voice to serve as the input of a hybrid model, performing self-adaptive learning on the hybrid model by taking the linear prediction coefficient feature of the target as the output, and obtaining a personalized end-to-end model
And f, generating the acoustic characteristics of the target by using the personalized end-to-end model, and performing adaptive model training on the average model of the vocoder to train the personalized vocoder model. And predicting the coding value of the audio by adopting a recurrent neural network according to the acoustic characteristics of the target, training a personalized vocoder model by combining the output target audio, and inserting a small amount of interference spectrum into the generated acoustic characteristics by adopting fuzzy processing in the training.
In the personalized process, the voice feature extraction model does not need to carry out self-adaptive training operation, and the personalized end-to-end model and the personalized vocoder model of the text-to-voice feature need to carry out self-adaptive learning.
And g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice. The acoustic feature vector of the target may be an acoustic feature vector of the target obtained by the speech feature extraction model, or a generated feature vector generated by the target before the speech feature extraction model is used.
Meanwhile, an end-to-end network adopted by the hybrid end-to-end model and the personalized end-to-end model is shown in fig. 4, and specifically, text features of phonemes corresponding to text conversion and voice feature vectors generated in the previous step are combined to serve as input of the end-to-end network.
The end-to-end network is divided into three parts, namely an encoder, a decoder and a back-end processor. An attention mechanism is adopted in the middle of the decoder, a window is arranged around the previous maximum weight point, the next maximum weight point is searched, and the alignment efficiency is improved. The method adopts a cyclic neural network structure, improves the dimensionality of sound features, increases the feature vectors of loss functions, improves the fitting effect of training, and adopts the Mel features to improve the naturalness of synthesized sound.
In the second embodiment, the invention further provides a device for generating personalized voice, which is shown in fig. 5. The device can adopt the method for generating the personalized voice, and the device further comprises: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.
Through the device provided by the second embodiment, the audio frequency of the sound is self-adaptively learned on the basic model through a small amount of linguistic data, the user can complete the personalization of the voice signal in a short time without adopting other linguistic data, and the MOS (mean opinion score) of the synthesized sound is as high as about 4.0
According to the method, the characteristic of the target voice is combined with the text characteristic vector, the self-adaptive learning is carried out on the generated mixed end-to-end model through the end-to-end text characteristic to audio characteristic converting unit, which is equivalent to the self-adaptive learning on the input closest to the target voice characteristic, the time required by the self-adaptive learning is reduced through the method, the feedback loss of the neural network fitting is reduced, the adjustment amplitude of the neural network is reduced, and the accuracy of the self-adaptive learning is improved. Through the personalized vocoder unit, the loss of the vocoder synthesis agent is reduced, and the naturalness of the voice synthesis is improved.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims (6)

1. A method of generating personalized speech, the method comprising the steps of:
step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices;
b, training a voice feature extraction model by using sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors;
c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice;
step d, inputting the acoustic features generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model;
step e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model;
f, generating acoustic characteristics of a target by using the personalized end-to-end model, carrying out adaptive model training on the average model of the vocoder, and training a personalized vocoder model;
and g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice.
2. The method of claim 1, wherein in the step b, the speech feature extraction model is to input the acoustic features of the samples obtained in the step a into a deep speech recognition model, and train the model with a deep learning network to obtain acoustic feature vectors of the samples corresponding to the acoustic features of different samples.
3. The method of generating personalized speech according to claim 2, wherein the deep learning network comprises: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.
4. The method of claim 1, wherein in step c, an end-to-end neural network is adopted to combine the feature vectors of the text and the acoustic features, a limited attention mechanism is used in the end-to-end neural network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features at the output end.
5. The method of claim 1, wherein in step f, a recurrent neural network is used to predict the encoded value of the audio according to the acoustic features of the target, a personalized vocoder model is trained in conjunction with the output target audio, and a blurring process is used to interpolate a small amount of interference spectrum to the generated acoustic features during the training.
6. An apparatus for generating personalized speech, wherein the method of any one of claims 1-5 is employed, the apparatus further comprising: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.
CN201911046823.7A 2019-10-30 2019-10-30 Method and device for generating personalized voice Pending CN110767210A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911046823.7A CN110767210A (en) 2019-10-30 2019-10-30 Method and device for generating personalized voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911046823.7A CN110767210A (en) 2019-10-30 2019-10-30 Method and device for generating personalized voice

Publications (1)

Publication Number Publication Date
CN110767210A true CN110767210A (en) 2020-02-07

Family

ID=69334723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911046823.7A Pending CN110767210A (en) 2019-10-30 2019-10-30 Method and device for generating personalized voice

Country Status (1)

Country Link
CN (1) CN110767210A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111462728A (en) * 2020-03-31 2020-07-28 北京字节跳动网络技术有限公司 Method, apparatus, electronic device and computer readable medium for generating speech
CN111462727A (en) * 2020-03-31 2020-07-28 北京字节跳动网络技术有限公司 Method, apparatus, electronic device and computer readable medium for generating speech
CN111739536A (en) * 2020-05-09 2020-10-02 北京捷通华声科技股份有限公司 Audio processing method and device
CN111785258A (en) * 2020-07-13 2020-10-16 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN112687296A (en) * 2021-03-10 2021-04-20 北京世纪好未来教育科技有限公司 Audio disfluency identification method, device, equipment and readable storage medium
WO2021169825A1 (en) * 2020-02-25 2021-09-02 阿里巴巴集团控股有限公司 Speech synthesis method and apparatus, device and storage medium
CN113409767A (en) * 2021-05-14 2021-09-17 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN113488057A (en) * 2021-08-18 2021-10-08 山东新一代信息产业技术研究院有限公司 Health-oriented conversation implementation method and system
WO2022094740A1 (en) * 2020-11-03 2022-05-12 Microsoft Technology Licensing, Llc Controlled training and use of text-to-speech models and personalized model generated voices

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114663A1 (en) * 2012-10-19 2014-04-24 Industrial Technology Research Institute Guided speaker adaptive speech synthesis system and method and computer program product
JP2015018080A (en) * 2013-07-10 2015-01-29 日本電信電話株式会社 Speech synthesis model learning device and speech synthesis device, and method and program thereof
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A kind of hybrid language phoneme synthesizing method and device
CN107564511A (en) * 2017-09-25 2018-01-09 平安科技(深圳)有限公司 Electronic installation, phoneme synthesizing method and computer-readable recording medium
CN109346056A (en) * 2018-09-20 2019-02-15 中国科学院自动化研究所 Phoneme synthesizing method and device based on depth measure network
CN110148398A (en) * 2019-05-16 2019-08-20 平安科技(深圳)有限公司 Training method, device, equipment and the storage medium of speech synthesis model
CN110379411A (en) * 2018-04-11 2019-10-25 阿里巴巴集团控股有限公司 For the phoneme synthesizing method and device of target speaker

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114663A1 (en) * 2012-10-19 2014-04-24 Industrial Technology Research Institute Guided speaker adaptive speech synthesis system and method and computer program product
JP2015018080A (en) * 2013-07-10 2015-01-29 日本電信電話株式会社 Speech synthesis model learning device and speech synthesis device, and method and program thereof
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A kind of hybrid language phoneme synthesizing method and device
CN107564511A (en) * 2017-09-25 2018-01-09 平安科技(深圳)有限公司 Electronic installation, phoneme synthesizing method and computer-readable recording medium
CN110379411A (en) * 2018-04-11 2019-10-25 阿里巴巴集团控股有限公司 For the phoneme synthesizing method and device of target speaker
CN109346056A (en) * 2018-09-20 2019-02-15 中国科学院自动化研究所 Phoneme synthesizing method and device based on depth measure network
CN110148398A (en) * 2019-05-16 2019-08-20 平安科技(深圳)有限公司 Training method, device, equipment and the storage medium of speech synthesis model

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
WO2021169825A1 (en) * 2020-02-25 2021-09-02 阿里巴巴集团控股有限公司 Speech synthesis method and apparatus, device and storage medium
CN111462728A (en) * 2020-03-31 2020-07-28 北京字节跳动网络技术有限公司 Method, apparatus, electronic device and computer readable medium for generating speech
CN111462727A (en) * 2020-03-31 2020-07-28 北京字节跳动网络技术有限公司 Method, apparatus, electronic device and computer readable medium for generating speech
CN111739536A (en) * 2020-05-09 2020-10-02 北京捷通华声科技股份有限公司 Audio processing method and device
CN111785258A (en) * 2020-07-13 2020-10-16 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN111785258B (en) * 2020-07-13 2022-02-01 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
WO2022094740A1 (en) * 2020-11-03 2022-05-12 Microsoft Technology Licensing, Llc Controlled training and use of text-to-speech models and personalized model generated voices
CN112687296A (en) * 2021-03-10 2021-04-20 北京世纪好未来教育科技有限公司 Audio disfluency identification method, device, equipment and readable storage medium
CN113409767A (en) * 2021-05-14 2021-09-17 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN113488057A (en) * 2021-08-18 2021-10-08 山东新一代信息产业技术研究院有限公司 Health-oriented conversation implementation method and system
CN113488057B (en) * 2021-08-18 2023-11-14 山东新一代信息产业技术研究院有限公司 Conversation realization method and system for health care

Similar Documents

Publication Publication Date Title
CN110767210A (en) Method and device for generating personalized voice
Han et al. Semantic-preserved communication system for highly efficient speech transmission
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
Ai et al. A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis
CN113450761B (en) Parallel voice synthesis method and device based on variation self-encoder
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
KR102272554B1 (en) Method and system of text to multiple speech
KR20190135853A (en) Method and system of text to multiple speech
CN111724809A (en) Vocoder implementation method and device based on variational self-encoder
CN111326170A (en) Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution
CN114495969A (en) Voice recognition method integrating voice enhancement
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
Kwon et al. Effective parameter estimation methods for an excitnet model in generative text-to-speech systems
CN115359775A (en) End-to-end tone and emotion migration Chinese voice cloning method
CN115359778A (en) Confrontation and meta-learning method based on speaker emotion voice synthesis model
Fujiwara et al. Data augmentation based on frequency warping for recognition of cleft palate speech
WO2022228704A1 (en) Decoder
Nikitaras et al. Fine-grained noise control for multispeaker speech synthesis
CN117041430B (en) Method and device for improving outbound quality and robustness of intelligent coordinated outbound system
Nijhawan et al. Real time speaker recognition system for hindi words
CN116994600B (en) Method and system for driving character mouth shape based on audio frequency
CN117909486B (en) Multi-mode question-answering method and system based on emotion recognition and large language model
Avikal et al. Estimation of age from speech using excitation source features
CN117153196B (en) PCM voice signal processing method, device, equipment and medium
CN117409761B (en) Method, device, equipment and storage medium for synthesizing voice based on frequency modulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200207

RJ01 Rejection of invention patent application after publication