CN104123933A - Self-adaptive non-parallel training based voice conversion method - Google Patents

Self-adaptive non-parallel training based voice conversion method Download PDF

Info

Publication number
CN104123933A
CN104123933A CN201410377091.0A CN201410377091A CN104123933A CN 104123933 A CN104123933 A CN 104123933A CN 201410377091 A CN201410377091 A CN 201410377091A CN 104123933 A CN104123933 A CN 104123933A
Authority
CN
China
Prior art keywords
speaker
voice
model
parameter
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410377091.0A
Other languages
Chinese (zh)
Inventor
王飞跃
孔庆杰
熊刚
朱凤华
朱春雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201410377091.0A priority Critical patent/CN104123933A/en
Publication of CN104123933A publication Critical patent/CN104123933A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a self-adaptive non-parallel training based voice conversion method. The method includes the steps: detecting effective voice signals from an acquired voice sample and preprocessing the effective voice signals; extracting voice characteristic parameters from the preprocessed effective voice signals; performing UBM (universal background model) training based on the voice characteristic parameters so as to obtain a UBM irrelevant to a speaker; obtaining an independent speaker voice model relevant to the speaker based on the UBM, and obtaining a conversion function of frequency spectrum parameters and base frequency parameters based on the independent speaker voice model; inputting the voice characteristic parameters of to-be-converted voice into the conversion function so as to obtain converted voice characteristic parameters of a target speaker; synthesizing the converted voice characteristic parameters of the target speaker so as to obtain target voice. The self-adaptive non-parallel training based voice conversion method not only has good conversion performance but also has good system expandability.

Description

Phonetics transfer method based on the non-parallel training of self-adaptation
Technical field
The present invention relates to the fields such as speech signal analysis, voice signal processing, speech conversion and phonetic synthesis, be specifically related to a kind of phonetics transfer method based on the non-parallel training of self-adaptation, belong to the speech conversion branch in field of voice signal.
Background technology
Speech conversion refers to and is keeping under the prerequisite that semantic content is constant, change speaker's personal characteristics, and making source speaker's voice is that target speaker says sounding like after conversion.Speech conversion is the depth development to speech synthesis and recognition technology, and speech conversion is as the new branch of field of voice signal, and the theoretical research with height is worth and application future.Use for reference the knowledge in the fields such as speech analysis and synthetic, speech recognition technology, encoding and decoding speech technology, voice enhancing and speaker verification and identification, for the development of Voice Conversion Techniques provides technical support, and the research of Voice Conversion Techniques, to promote the development in these fields again, for the further research in these fields provides valuable reference significance.
At present, speech conversion can be divided into speech conversion between language of the same race and across the speech conversion of language large classification.For the speech conversion between language of the same race, in the training stage, different because of the selection of language material, be divided into again parallel corpora training and non-parallel corpus training.For the speech conversion across language, it is impossible obtaining parallel corpora, can only train by non-parallel corpus.By several generations' effort, the research of speech conversion has obtained very large development, a lot of scholars have proposed different conversion methods, summary is got up, roughly there is following a few class: vector quantization method, the Linear Multivariable Return Law, artificial neural network method, many speakers interpolation transformation approach, gauss hybrid models etc.But above method is all the speech conversion based on parallel corpora joint training, also has in actual applications some problems: 1. in a lot of situations, the very difficult acquisition of parallel corpora even can not get; 2. the training calculated amount based on union feature vector is very large, and the accuracy requirement that phonetic element is aimed at is very high; 3. associating speech model adopts the method for joint training to make the expansion of system inconvenient, and dirigibility is very poor.
For these problems, although researchist had carried out the research of speech conversion under non-parallel corpus in the last few years, but these methods are mostly still confined to solve that the restriction of parallel corpora adopts is associating voice training methods, can't solve second and third problem.Such as the people such as Mouchtaris were published in " IEEE Transactions on Audio in 2006, Speech and Language Processing (audio frequency, pronunciation and language processing IEEE journal) " the paper of by name " Nonparallel training for voice conversion based on a parameter adaptation approach (based on the non-parallel training utterance conversion of parameter adaptive method) " of the 14th volume the 3rd phase adopt parameter adaptive method to remove conversion spectrum envelope, the people such as Tao Jianhua were published in " IEEE Transactions on Audio in 2010, Speech and Language Processing (audio frequency, pronunciation and language processing IEEE proceedings) " the paper of by name " Supervisory Data Alignment for Text-Independent Voice Conversion (based on the text-independent sound conversion of monitoring data alignment) " of the 18th volume the 5th phase propose speech conversion realized to the exercise supervision method of data arrangement of non-parallel corpus, the people such as Ling-Hui Chen were in " the IEEE International Conference on Acoustics of 2011, Speech and Signal Processing (acoustics, the ieee international conference of voice and signal transacting) " on delivered in the paper of " Non-Parallel Training For Voice Conversion Based On FT-GMM (based on the non-parallel training utterance conversion of FT-GMM model) " by name and adopt the gauss hybrid models (FT-GMM) of eigentransformation to carry out the research of non-parallel training utterance conversion, the people such as Daojian Zeng were in " the 2010 IEEE 10th International Conference on Signal Processing of 2010, (2010 IEEE association signal transacting international conference) " on delivered " Voice Conversion Using Structrued Gaussian Mixture Model by name, (speech conversion of structure based gauss hybrid models) " paper in use structuring gauss hybrid models to achieve the speech conversion based on independent speaker model.
Because the phonetics transfer method based on parallel corpora has been subject to above-mentioned all constraints, caused Voice Conversion Techniques to be difficult to comprehensively move towards practical application, as obtained independently speaker's speech model by non-parallel training method, change source speaker's personal characteristics parameter, the personal characteristics that adds target speaker, realize the conversion between source-target, this will be huge contribution to the development in speech conversion field.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of phonetics transfer method of new non-parallel corpus training, to solve the following problem existing in parallel corpora joint training phonetics transfer method: 1, need parallel corpora training to obtain transfer function in traditional voice converting system, and parallel corpora is difficult to obtain; 2, traditional voice converting system need to be carried out joint training to eigenvector; 3, the expansion of traditional voice converting system is inconvenient.
First the inventive method extracts fundamental frequency and the short-time spectrum of all voice signals, from short-time spectrum, obtain corresponding LPCC parameter, then all characteristic parameters are carried out to universal background model (UBM:Universal Background Model) training, recycling maximum a posteriori probability (MAP:Maximum a Posterior Probability) adaptive approach is derived and is talked about specifically human model, finally obtains corresponding transfer function and carries out speech conversion.
The phonetics transfer method of the non-parallel training of a kind of self-adaptation that particularly, the present invention proposes comprises the following steps:
Step 1 detects effective voice signal from the speech samples collecting, and described efficient voice signal is carried out to pre-service;
Step 2, for the efficient voice signal extraction speech characteristic parameter obtaining after pre-service;
Step 3, carries out UBM training based on described speech characteristic parameter, obtain one with the irrelevant UBM model of speaker;
Step 4, based on described UBM model, obtains the independent speaker speech model relevant with speaker, based on described independent speaker's speech model, obtains the transfer function of frequency spectrum parameter and base frequency parameters;
Step 5, is input to the speech characteristic parameter of voice to be converted in the transfer function that described step 4 obtains the speech characteristic parameter of the target speaker after being changed;
Step 6, synthesizes the speech characteristic parameter of the target speaker after conversion, obtains target voice.
Compared with prior art, the invention has the advantages that:
Traditional phonetics transfer method mostly adopts parallel corpora training source-target speaker to combine speech model and the corresponding speech conversion function of deriving thus, but in practical application, be not only difficult to obtain completely parallel language material, and training associating speech model need to consume a large amount of calculating, system extension is inconvenient.The present invention has avoided the harsh requirement of parallel training to language material, adopts non-parallel corpus to train and change, and without joint training, and system extension is flexible.
Accompanying drawing explanation
Fig. 1 is the process flow diagram that the present invention optimizes the phonetics transfer method of the non-parallel training of self-adaptation;
Fig. 2 is the derivation schematic diagram of frequency spectrum parameter transfer function of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Fig. 1 is the process flow diagram of the phonetics transfer method of the non-parallel training of optimization self-adaptation that adopts of the present invention, as shown in Figure 1, said method comprising the steps of:
Step 1 detects effective voice signal from the speech samples collecting, and described efficient voice signal is carried out to pre-service;
In an embodiment of the present invention, described pre-service comprises pre-emphasis, adds the processing such as Hamming window and minute frame.
Step 2, for the efficient voice signal extraction speech characteristic parameter obtaining after pre-service;
Described speech characteristic parameter can be fundamental frequency, linear prediction cepstrum coefficient coefficient (LPCC), Mel cepstral coefficients (MFCC), the speech characteristic parameters such as line spectrum pair (LSP).
In an embodiment of the present invention, all efficient voice signals are obtained to fundamental frequency F0 and the short-time spectrum parameter of every frame signal by STRAIGHT platform, short-time spectrum parameter based on trying to achieve utilizes Levenson-Durbin algorithm to ask for the LPC coefficient of every frame voice signal, then LPC coefficient is converted into LPCC coefficient, obtain the speaker's of all participation training speech characteristic parameter, wherein, for obtaining the fundamental frequency model of fundamental frequency F0, by the Gaussian distribution of single order, describe.
Step 3, carries out UBM training based on described speech characteristic parameter, obtain one with the irrelevant UBM model of speaker;
In this step, carrying out UBM when training, first speak difference that human nature do not go up and the size of each speaker's training corpus of balance, then merges the speech characteristic parameter of the training that is useful on, and by EM Algorithm for Training, obtains UBM model.Wherein, in initial UBM model, the initializes weights of each composition is 1/M, and M is mixed Gaussian number of components in UBM model.
UBM (universal background model) be one with the irrelevant global context model of speaker, global context model is a large-scale gauss hybrid models (GMM) in essence, generally the language material training by a large amount of speakers obtains, its thought is exactly that all speakers' information is included in the formed super vector of mixed Gaussian density function, it has reflected the statistical average distribution character of all speaker's sound characteristics, thereby has eliminated personal characteristics.As master pattern, UBM has been contained a plurality of subspaces, and wherein the corresponding cluster centre of every sub spaces, describes with Gaussian probability-density function, each subspace representation a part of feature space.
Step 4, based on described UBM model, obtains the independent speaker speech model relevant with speaker, based on described independent speaker's speech model, obtains the transfer function of frequency spectrum parameter and base frequency parameters;
Described step 4 is further comprising the steps:
Step 41, carries out respectively pre-service to source speaker and target speaker's training utterance;
In an embodiment of the present invention, described pre-service comprises pre-emphasis, adds the processing such as Hamming window and minute frame.
Step 42, extracts respectively both LPCC parameter and base frequency parameters;
Step 43 based on LPCC parameter, obtains respectively source speaker and target speaker's GMM model from UBM model;
In an embodiment of the present invention, by the adaptive method of MAP, from UBM model, obtain respectively source speaker and target speaker's GMM model.
Each speaker's GMM model is to be described by mean vector, covariance matrix and hybrid weight, is expressed as: λ = { ω i , μ i , Σ → i } , i = 1,2 , · · · , M , And have wherein, ω irepresent hybrid weight, μ irepresent mean parameter vector, represent covariance matrix, M is the exponent number of GMM model, and the probability density function of the GMM model on M rank is obtained by M Gaussian probability-density function weighted sum.
If trained, obtain a UBM model λ={ ω i, μ i, Σ i, certain speaker's eigenvector is expressed as X={x 1..., x t..., x t, by the adaptive method of MAP, from UBM model, obtain respectively source speaker and target speaker's the concrete steps of GMM model as follows:
First, calculate the weight of each gaussian component in GMM model:
Pr ( i | x t ) = w i p i ( x t | μ i , Σ i ) Σ j = 1 M w j p j ( x t | μ j , Σ j ) ,
Wherein, x trepresent t dimensional feature vector, w ithe weight that represents i gaussian component, p i(x t| μ i, Σ i) represent the posterior probability of gaussian component i, μ irepresent mean vector, Σ irepresent covariance matrix, w jthe weight that represents gaussian component j, p j(x t| μ j, Σ j) represent the posterior probability of gaussian component j.
Then, utilize the weight Pr (i|x obtaining t) and characteristic component x tcalculate for upgrading the statistic n of average and variance i:
n i = Σ t = 1 T Pr ( i | x t ) ,
Wherein, T represents the length (being frame number) of trained vector.
Then, utilize statistic n ithe average and the variance that are directed to each gaussian component i with old UBM model parameter are upgraded, and then obtain source speaker and target speaker's GMM model.
Wherein, utilize statistic n ibe directed to the average of each gaussian component i with old UBM model parameter and formula that variance is upgraded as follows:
μ ^ i = Σ t = 1 T Pr ( i | x t ) x t + τ μ i n i + τ ,
σ ^ i 2 = τσ i 2 + τ ( μ ^ i - μ i ) ( μ ^ i - μ i ) T + Σ t = 1 T Pr ( i | x t ) ( x t - μ ^ i ) ( x t - μ ^ i ) T n i + τ ,
Wherein, the average that represents the gaussian component i after upgrading, the variance that represents the gaussian component i after upgrading, the variance that represents former gaussian component i, τ is a fixed value, in an embodiment of the present invention, τ=20.
By upper, from the UBM model of training, after self-adaptation, can obtain the GMM model relevant with speaker.These models all from this benchmark model of UBM self-adaptation obtain, each component of all models and each component in UBM are consistent, thereby between each component of these models, are automatically corresponding alignment in order.Like this, the conversion of model is just converted into the conversion between each gaussian component, by deriving, can obtain the transfer function of frequency spectrum parameter and fundamental frequency.
Step 44, asks for average and the variance of base frequency parameters, and uses the Gauss model of single order to carry out modeling to it;
Step 45, the base frequency parameters model that the GMM model obtaining according to described step 43 and described step 44 obtain, obtains the transfer function of frequency spectrum parameter and base frequency parameters.
In this step, the derivation of the transfer function of frequency spectrum parameter is as follows:
Fig. 2 has described the relation between source speaker and two gaussian component of target speaker personal characteristics, uses respectively (μ s, σ s) and (μ t, σ t) represent, X represents source speaker's frequency spectrum parameter to be converted, Y represents the frequency spectrum parameter of the target speaker after conversion, the formula below can deriving from Fig. 2:
Y - μ t X - μ x = σ t σ s ,
Thereby have:
Y = μ t + σ t σ s ( x - μ s ) ,
Consider that, after the weighted sum of all gaussian component, the transfer function of frequency spectrum parameter is expressed as:
F ( X ) = Σ i = 1 Q p i ( X ) [ μ i T + Σ i T Σ i S ( X - μ i S ) ] ,
Wherein, p i(X) be the posterior probability of i gaussian component of source speaker GMM model, Q represents the dimension of gaussian component, average and the covariance matrix of i gaussian component of source speaker GMM model, average and the covariance matrix of i gaussian component of target speaker GMM model.
In this step, adopt Gauss model transformation approach to obtain the transfer function of base frequency parameters, the method supposition fundamental frequency of source speaker and target speaker's fundamental frequency be Normal Distribution all, and the transfer function of described base frequency parameters is expressed as:
F 0 T = μ T + σ T σ S ( F 0 S - μ S ) ,
Wherein, μ sand μ tthe average that represents respectively source and target speaker speech pitch, σ sand σ tthe variance that represents source and target speaker speech pitch, it is the fundamental frequency of source speaker's voice.
Step 5, is input to the speech characteristic parameter of voice to be converted in the transfer function that described step 4 obtains the speech characteristic parameter of the target speaker after being changed;
This step is further comprising the steps:
Step 51, the short-time spectrum of extraction source speaker voice to be converted and fundamental frequency F0;
In an embodiment of the present invention, use short-time spectrum and the fundamental frequency F0 of STRAIGHT extraction source speaker voice to be converted.
Step 52, goes out LPCC parameter by short-time spectrum envelope extraction;
Step 53, changes source speaker's LPCC parameter and fundamental frequency F0 according to described frequency spectrum parameter transfer function and base frequency parameters transfer function respectively, obtains target speaker's LPCC parameter and base frequency parameters.
Step 6, synthesizes the speech characteristic parameter of the target speaker after conversion, obtains target voice.
This step is further comprising the steps:
Step 61, the LPCC parameter revaluation based on after conversion goes out target speaker's short-time spectrum envelope;
Step 62, in conjunction with the fundamental frequency F0 after described short-time spectrum envelope and conversion, obtains having the voice of target speaker characteristic.
In described step 62, the fundamental frequency F0 by STRAIGHT platform for described short-time spectrum envelope and after changing synthesizes.
Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. the phonetics transfer method based on the non-parallel training of self-adaptation, is characterized in that, the method comprises the following steps:
Step 1 detects effective voice signal from the speech samples collecting, and described efficient voice signal is carried out to pre-service;
Step 2, for the efficient voice signal extraction speech characteristic parameter obtaining after pre-service;
Step 3, carries out UBM training based on described speech characteristic parameter, obtain one with the irrelevant UBM model of speaker;
Step 4, based on described UBM model, obtains the independent speaker speech model relevant with speaker, based on described independent speaker's speech model, obtains the transfer function of frequency spectrum parameter and base frequency parameters;
Step 5, is input to the speech characteristic parameter of voice to be converted in the transfer function that described step 4 obtains the speech characteristic parameter of the target speaker after being changed;
Step 6, synthesizes the speech characteristic parameter of the target speaker after conversion, obtains target voice.
2. method according to claim 1, is characterized in that, described pre-service includes but not limited to pre-emphasis, adds Hamming window and minute frame processing.
3. method according to claim 1, is characterized in that, described speech characteristic parameter includes but not limited to fundamental frequency, linear prediction cepstrum coefficient coefficient LPCC, Mel cepstral coefficients MFCC and line spectrum pair LSP.
4. method according to claim 1, is characterized in that, in described step 2, first obtains fundamental frequency F0 and the short-time spectrum parameter of every frame efficient voice signal; Then the short-time spectrum parameter based on trying to achieve is asked for the LPC coefficient of every frame voice signal; Then LPC coefficient is converted into LPCC coefficient.
5. method according to claim 1, is characterized in that, in described step 3, is carrying out UBM when training, first speak difference that human nature do not go up and the size of each speaker's training corpus of balance; Then merge the speech characteristic parameter of the training that is useful on, by EM Algorithm for Training, obtain UBM model.
6. method according to claim 1, is characterized in that, described step 4 is further comprising the steps:
Step 41, carries out respectively pre-service to source speaker and target speaker's training utterance;
Step 42, extracts respectively both LPCC parameter and base frequency parameters;
Step 43 based on LPCC parameter, obtains respectively source speaker and target speaker's GMM model from UBM model;
Step 44, asks for average and the variance of base frequency parameters, and uses the Gauss model of single order to carry out modeling to it;
Step 45, the base frequency parameters model that the GMM model obtaining according to described step 43 and described step 44 obtain, obtains the transfer function of frequency spectrum parameter and base frequency parameters.
7. method according to claim 6, is characterized in that, by MAP adaptive approach, obtains respectively source speaker and target speaker's GMM model from UBM model.
8. method according to claim 6, is characterized in that, the transfer function of described frequency spectrum parameter is expressed as:
F ( X ) = Σ i = 1 Q p i ( X ) [ μ i T + Σ i T Σ i S ( X - μ i S ) ] ,
Wherein, p i(X) be the posterior probability of i gaussian component of source speaker GMM model, Q represents the dimension of gaussian component, average and the covariance matrix of i gaussian component of source speaker GMM model, average and the covariance matrix of i gaussian component of target speaker GMM model;
The transfer function of described base frequency parameters is expressed as:
F 0 T = μ T + σ T σ S ( F 0 S - μ S ) ,
Wherein, μ sand μ tthe average that represents respectively source and target speaker speech pitch, σ sand σ tthe variance that represents source and target speaker speech pitch, it is the fundamental frequency of source speaker's voice.
9. method according to claim 1, is characterized in that, described step 5 is further comprising the steps:
Step 51, the short-time spectrum of extraction source speaker voice to be converted and fundamental frequency F0;
Step 52, goes out LPCC parameter by short-time spectrum envelope extraction;
Step 53, changes source speaker's LPCC parameter and fundamental frequency F0 according to described frequency spectrum parameter transfer function and base frequency parameters transfer function respectively, obtains target speaker's LPCC parameter and base frequency parameters.
10. method according to claim 1, is characterized in that, described step 6 is further comprising the steps:
Step 61, the LPCC parameter revaluation based on after conversion goes out target speaker's short-time spectrum envelope;
Step 62, in conjunction with the fundamental frequency F0 after described short-time spectrum envelope and conversion, obtains having the voice of target speaker characteristic.
CN201410377091.0A 2014-08-01 2014-08-01 Self-adaptive non-parallel training based voice conversion method Pending CN104123933A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410377091.0A CN104123933A (en) 2014-08-01 2014-08-01 Self-adaptive non-parallel training based voice conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410377091.0A CN104123933A (en) 2014-08-01 2014-08-01 Self-adaptive non-parallel training based voice conversion method

Publications (1)

Publication Number Publication Date
CN104123933A true CN104123933A (en) 2014-10-29

Family

ID=51769323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410377091.0A Pending CN104123933A (en) 2014-08-01 2014-08-01 Self-adaptive non-parallel training based voice conversion method

Country Status (1)

Country Link
CN (1) CN104123933A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN105895080A (en) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 Voice recognition model training method, speaker type recognition method and device
CN106448673A (en) * 2016-09-18 2017-02-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Chinese electrolarynx speech conversion method
CN107507619A (en) * 2017-09-11 2017-12-22 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
CN108766465A (en) * 2018-06-06 2018-11-06 华中师范大学 A kind of digital audio based on ENF universal background models distorts blind checking method
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
WO2018223796A1 (en) * 2017-06-07 2018-12-13 腾讯科技(深圳)有限公司 Speech recognition method, storage medium, and speech recognition device
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109377986A (en) * 2018-11-29 2019-02-22 四川长虹电器股份有限公司 A kind of non-parallel corpus voice personalization conversion method
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet
CN110164414A (en) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 Method of speech processing, device and smart machine

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103021418A (en) * 2012-12-13 2013-04-03 南京邮电大学 Voice conversion method facing to multi-time scale prosodic features
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103021418A (en) * 2012-12-13 2013-04-03 南京邮电大学 Voice conversion method facing to multi-time scale prosodic features
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱春雷: "优化自适应非平行训练语音转换算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN105390141B (en) * 2015-10-14 2019-10-18 科大讯飞股份有限公司 Sound converting method and device
CN105895080A (en) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 Voice recognition model training method, speaker type recognition method and device
CN106448673A (en) * 2016-09-18 2017-02-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Chinese electrolarynx speech conversion method
CN106448673B (en) * 2016-09-18 2019-12-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 chinese electronic larynx speech conversion method
WO2018223796A1 (en) * 2017-06-07 2018-12-13 腾讯科技(深圳)有限公司 Speech recognition method, storage medium, and speech recognition device
CN107507619A (en) * 2017-09-11 2017-12-22 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN107507619B (en) * 2017-09-11 2021-08-20 厦门美图之家科技有限公司 Voice conversion method and device, electronic equipment and readable storage medium
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
CN108108357B (en) * 2018-01-12 2022-08-09 京东方科技集团股份有限公司 Accent conversion method and device and electronic equipment
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN108766465A (en) * 2018-06-06 2018-11-06 华中师范大学 A kind of digital audio based on ENF universal background models distorts blind checking method
CN108766465B (en) * 2018-06-06 2020-07-28 华中师范大学 Digital audio tampering blind detection method based on ENF general background model
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109326283B (en) * 2018-11-23 2021-01-26 南京邮电大学 Many-to-many voice conversion method based on text encoder under non-parallel text condition
CN109377986A (en) * 2018-11-29 2019-02-22 四川长虹电器股份有限公司 A kind of non-parallel corpus voice personalization conversion method
CN109377986B (en) * 2018-11-29 2022-02-01 四川长虹电器股份有限公司 Non-parallel corpus voice personalized conversion method
CN110164414B (en) * 2018-11-30 2023-02-14 腾讯科技(深圳)有限公司 Voice processing method and device and intelligent equipment
CN110164414A (en) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 Method of speech processing, device and smart machine
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
CN109584893B (en) * 2018-12-26 2021-09-14 南京邮电大学 VAE and i-vector based many-to-many voice conversion system under non-parallel text condition
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109599091B (en) * 2019-01-14 2021-01-26 南京邮电大学 Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet

Similar Documents

Publication Publication Date Title
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
Dave Feature extraction methods LPC, PLP and MFCC in speech recognition
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN102332263B (en) Close neighbor principle based speaker recognition method for synthesizing emotional model
Song et al. Noise invariant frame selection: a simple method to address the background noise problem for text-independent speaker verification
CN105593936B (en) System and method for text-to-speech performance evaluation
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN106653056A (en) Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof
CN102426834B (en) Method for testing rhythm level of spoken English
CN102810311B (en) Speaker estimation method and speaker estimation equipment
Das et al. Bangladeshi dialect recognition using Mel frequency cepstral coefficient, delta, delta-delta and Gaussian mixture model
CN105206257A (en) Voice conversion method and device
CN113506562B (en) End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
CN110047501A (en) Multi-to-multi phonetics transfer method based on beta-VAE
CN101419800B (en) Emotional speaker recognition method based on frequency spectrum translation
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
Gamit et al. Isolated words recognition using mfcc lpc and neural network
Garg et al. Survey on acoustic modeling and feature extraction for speech recognition
Kaur et al. Genetic algorithm for combined speaker and speech recognition using deep neural networks
Wu et al. The DKU-LENOVO Systems for the INTERSPEECH 2019 Computational Paralinguistic Challenge.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141029

WD01 Invention patent application deemed withdrawn after publication