CN104123933A - Self-adaptive non-parallel training based voice conversion method - Google Patents
Self-adaptive non-parallel training based voice conversion method Download PDFInfo
- Publication number
- CN104123933A CN104123933A CN201410377091.0A CN201410377091A CN104123933A CN 104123933 A CN104123933 A CN 104123933A CN 201410377091 A CN201410377091 A CN 201410377091A CN 104123933 A CN104123933 A CN 104123933A
- Authority
- CN
- China
- Prior art keywords
- speaker
- voice
- model
- parameter
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 49
- 238000012549 training Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000001228 spectrum Methods 0.000 claims abstract description 33
- 238000012546 transfer Methods 0.000 claims description 32
- 238000012545 processing Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000013459 approach Methods 0.000 claims description 5
- 239000004576 sand Substances 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 abstract 1
- 230000002194 synthesizing effect Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 20
- 238000011160 research Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 5
- 239000000463 material Substances 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001831 conversion spectrum Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Abstract
The invention discloses a self-adaptive non-parallel training based voice conversion method. The method includes the steps: detecting effective voice signals from an acquired voice sample and preprocessing the effective voice signals; extracting voice characteristic parameters from the preprocessed effective voice signals; performing UBM (universal background model) training based on the voice characteristic parameters so as to obtain a UBM irrelevant to a speaker; obtaining an independent speaker voice model relevant to the speaker based on the UBM, and obtaining a conversion function of frequency spectrum parameters and base frequency parameters based on the independent speaker voice model; inputting the voice characteristic parameters of to-be-converted voice into the conversion function so as to obtain converted voice characteristic parameters of a target speaker; synthesizing the converted voice characteristic parameters of the target speaker so as to obtain target voice. The self-adaptive non-parallel training based voice conversion method not only has good conversion performance but also has good system expandability.
Description
Technical field
The present invention relates to the fields such as speech signal analysis, voice signal processing, speech conversion and phonetic synthesis, be specifically related to a kind of phonetics transfer method based on the non-parallel training of self-adaptation, belong to the speech conversion branch in field of voice signal.
Background technology
Speech conversion refers to and is keeping under the prerequisite that semantic content is constant, change speaker's personal characteristics, and making source speaker's voice is that target speaker says sounding like after conversion.Speech conversion is the depth development to speech synthesis and recognition technology, and speech conversion is as the new branch of field of voice signal, and the theoretical research with height is worth and application future.Use for reference the knowledge in the fields such as speech analysis and synthetic, speech recognition technology, encoding and decoding speech technology, voice enhancing and speaker verification and identification, for the development of Voice Conversion Techniques provides technical support, and the research of Voice Conversion Techniques, to promote the development in these fields again, for the further research in these fields provides valuable reference significance.
At present, speech conversion can be divided into speech conversion between language of the same race and across the speech conversion of language large classification.For the speech conversion between language of the same race, in the training stage, different because of the selection of language material, be divided into again parallel corpora training and non-parallel corpus training.For the speech conversion across language, it is impossible obtaining parallel corpora, can only train by non-parallel corpus.By several generations' effort, the research of speech conversion has obtained very large development, a lot of scholars have proposed different conversion methods, summary is got up, roughly there is following a few class: vector quantization method, the Linear Multivariable Return Law, artificial neural network method, many speakers interpolation transformation approach, gauss hybrid models etc.But above method is all the speech conversion based on parallel corpora joint training, also has in actual applications some problems: 1. in a lot of situations, the very difficult acquisition of parallel corpora even can not get; 2. the training calculated amount based on union feature vector is very large, and the accuracy requirement that phonetic element is aimed at is very high; 3. associating speech model adopts the method for joint training to make the expansion of system inconvenient, and dirigibility is very poor.
For these problems, although researchist had carried out the research of speech conversion under non-parallel corpus in the last few years, but these methods are mostly still confined to solve that the restriction of parallel corpora adopts is associating voice training methods, can't solve second and third problem.Such as the people such as Mouchtaris were published in " IEEE Transactions on Audio in 2006, Speech and Language Processing (audio frequency, pronunciation and language processing IEEE journal) " the paper of by name " Nonparallel training for voice conversion based on a parameter adaptation approach (based on the non-parallel training utterance conversion of parameter adaptive method) " of the 14th volume the 3rd phase adopt parameter adaptive method to remove conversion spectrum envelope, the people such as Tao Jianhua were published in " IEEE Transactions on Audio in 2010, Speech and Language Processing (audio frequency, pronunciation and language processing IEEE proceedings) " the paper of by name " Supervisory Data Alignment for Text-Independent Voice Conversion (based on the text-independent sound conversion of monitoring data alignment) " of the 18th volume the 5th phase propose speech conversion realized to the exercise supervision method of data arrangement of non-parallel corpus, the people such as Ling-Hui Chen were in " the IEEE International Conference on Acoustics of 2011, Speech and Signal Processing (acoustics, the ieee international conference of voice and signal transacting) " on delivered in the paper of " Non-Parallel Training For Voice Conversion Based On FT-GMM (based on the non-parallel training utterance conversion of FT-GMM model) " by name and adopt the gauss hybrid models (FT-GMM) of eigentransformation to carry out the research of non-parallel training utterance conversion, the people such as Daojian Zeng were in " the 2010 IEEE 10th International Conference on Signal Processing of 2010, (2010 IEEE association signal transacting international conference) " on delivered " Voice Conversion Using Structrued Gaussian Mixture Model by name, (speech conversion of structure based gauss hybrid models) " paper in use structuring gauss hybrid models to achieve the speech conversion based on independent speaker model.
Because the phonetics transfer method based on parallel corpora has been subject to above-mentioned all constraints, caused Voice Conversion Techniques to be difficult to comprehensively move towards practical application, as obtained independently speaker's speech model by non-parallel training method, change source speaker's personal characteristics parameter, the personal characteristics that adds target speaker, realize the conversion between source-target, this will be huge contribution to the development in speech conversion field.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of phonetics transfer method of new non-parallel corpus training, to solve the following problem existing in parallel corpora joint training phonetics transfer method: 1, need parallel corpora training to obtain transfer function in traditional voice converting system, and parallel corpora is difficult to obtain; 2, traditional voice converting system need to be carried out joint training to eigenvector; 3, the expansion of traditional voice converting system is inconvenient.
First the inventive method extracts fundamental frequency and the short-time spectrum of all voice signals, from short-time spectrum, obtain corresponding LPCC parameter, then all characteristic parameters are carried out to universal background model (UBM:Universal Background Model) training, recycling maximum a posteriori probability (MAP:Maximum a Posterior Probability) adaptive approach is derived and is talked about specifically human model, finally obtains corresponding transfer function and carries out speech conversion.
The phonetics transfer method of the non-parallel training of a kind of self-adaptation that particularly, the present invention proposes comprises the following steps:
Step 1 detects effective voice signal from the speech samples collecting, and described efficient voice signal is carried out to pre-service;
Step 2, for the efficient voice signal extraction speech characteristic parameter obtaining after pre-service;
Step 3, carries out UBM training based on described speech characteristic parameter, obtain one with the irrelevant UBM model of speaker;
Step 4, based on described UBM model, obtains the independent speaker speech model relevant with speaker, based on described independent speaker's speech model, obtains the transfer function of frequency spectrum parameter and base frequency parameters;
Step 5, is input to the speech characteristic parameter of voice to be converted in the transfer function that described step 4 obtains the speech characteristic parameter of the target speaker after being changed;
Step 6, synthesizes the speech characteristic parameter of the target speaker after conversion, obtains target voice.
Compared with prior art, the invention has the advantages that:
Traditional phonetics transfer method mostly adopts parallel corpora training source-target speaker to combine speech model and the corresponding speech conversion function of deriving thus, but in practical application, be not only difficult to obtain completely parallel language material, and training associating speech model need to consume a large amount of calculating, system extension is inconvenient.The present invention has avoided the harsh requirement of parallel training to language material, adopts non-parallel corpus to train and change, and without joint training, and system extension is flexible.
Accompanying drawing explanation
Fig. 1 is the process flow diagram that the present invention optimizes the phonetics transfer method of the non-parallel training of self-adaptation;
Fig. 2 is the derivation schematic diagram of frequency spectrum parameter transfer function of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Fig. 1 is the process flow diagram of the phonetics transfer method of the non-parallel training of optimization self-adaptation that adopts of the present invention, as shown in Figure 1, said method comprising the steps of:
Step 1 detects effective voice signal from the speech samples collecting, and described efficient voice signal is carried out to pre-service;
In an embodiment of the present invention, described pre-service comprises pre-emphasis, adds the processing such as Hamming window and minute frame.
Step 2, for the efficient voice signal extraction speech characteristic parameter obtaining after pre-service;
Described speech characteristic parameter can be fundamental frequency, linear prediction cepstrum coefficient coefficient (LPCC), Mel cepstral coefficients (MFCC), the speech characteristic parameters such as line spectrum pair (LSP).
In an embodiment of the present invention, all efficient voice signals are obtained to fundamental frequency F0 and the short-time spectrum parameter of every frame signal by STRAIGHT platform, short-time spectrum parameter based on trying to achieve utilizes Levenson-Durbin algorithm to ask for the LPC coefficient of every frame voice signal, then LPC coefficient is converted into LPCC coefficient, obtain the speaker's of all participation training speech characteristic parameter, wherein, for obtaining the fundamental frequency model of fundamental frequency F0, by the Gaussian distribution of single order, describe.
Step 3, carries out UBM training based on described speech characteristic parameter, obtain one with the irrelevant UBM model of speaker;
In this step, carrying out UBM when training, first speak difference that human nature do not go up and the size of each speaker's training corpus of balance, then merges the speech characteristic parameter of the training that is useful on, and by EM Algorithm for Training, obtains UBM model.Wherein, in initial UBM model, the initializes weights of each composition is 1/M, and M is mixed Gaussian number of components in UBM model.
UBM (universal background model) be one with the irrelevant global context model of speaker, global context model is a large-scale gauss hybrid models (GMM) in essence, generally the language material training by a large amount of speakers obtains, its thought is exactly that all speakers' information is included in the formed super vector of mixed Gaussian density function, it has reflected the statistical average distribution character of all speaker's sound characteristics, thereby has eliminated personal characteristics.As master pattern, UBM has been contained a plurality of subspaces, and wherein the corresponding cluster centre of every sub spaces, describes with Gaussian probability-density function, each subspace representation a part of feature space.
Step 4, based on described UBM model, obtains the independent speaker speech model relevant with speaker, based on described independent speaker's speech model, obtains the transfer function of frequency spectrum parameter and base frequency parameters;
Described step 4 is further comprising the steps:
Step 41, carries out respectively pre-service to source speaker and target speaker's training utterance;
In an embodiment of the present invention, described pre-service comprises pre-emphasis, adds the processing such as Hamming window and minute frame.
Step 42, extracts respectively both LPCC parameter and base frequency parameters;
Step 43 based on LPCC parameter, obtains respectively source speaker and target speaker's GMM model from UBM model;
In an embodiment of the present invention, by the adaptive method of MAP, from UBM model, obtain respectively source speaker and target speaker's GMM model.
Each speaker's GMM model is to be described by mean vector, covariance matrix and hybrid weight, is expressed as:
And have
wherein, ω
irepresent hybrid weight, μ
irepresent mean parameter vector,
represent covariance matrix, M is the exponent number of GMM model, and the probability density function of the GMM model on M rank is obtained by M Gaussian probability-density function weighted sum.
If trained, obtain a UBM model λ={ ω
i, μ
i, Σ
i, certain speaker's eigenvector is expressed as X={x
1..., x
t..., x
t, by the adaptive method of MAP, from UBM model, obtain respectively source speaker and target speaker's the concrete steps of GMM model as follows:
First, calculate the weight of each gaussian component in GMM model:
Wherein, x
trepresent t dimensional feature vector, w
ithe weight that represents i gaussian component, p
i(x
t| μ
i, Σ
i) represent the posterior probability of gaussian component i, μ
irepresent mean vector, Σ
irepresent covariance matrix, w
jthe weight that represents gaussian component j, p
j(x
t| μ
j, Σ
j) represent the posterior probability of gaussian component j.
Then, utilize the weight Pr (i|x obtaining
t) and characteristic component x
tcalculate for upgrading the statistic n of average and variance
i:
Wherein, T represents the length (being frame number) of trained vector.
Then, utilize statistic n
ithe average and the variance that are directed to each gaussian component i with old UBM model parameter are upgraded, and then obtain source speaker and target speaker's GMM model.
Wherein, utilize statistic n
ibe directed to the average of each gaussian component i with old UBM model parameter and formula that variance is upgraded as follows:
Wherein,
the average that represents the gaussian component i after upgrading,
the variance that represents the gaussian component i after upgrading,
the variance that represents former gaussian component i, τ is a fixed value, in an embodiment of the present invention, τ=20.
By upper, from the UBM model of training, after self-adaptation, can obtain the GMM model relevant with speaker.These models all from this benchmark model of UBM self-adaptation obtain, each component of all models and each component in UBM are consistent, thereby between each component of these models, are automatically corresponding alignment in order.Like this, the conversion of model is just converted into the conversion between each gaussian component, by deriving, can obtain the transfer function of frequency spectrum parameter and fundamental frequency.
Step 44, asks for average and the variance of base frequency parameters, and uses the Gauss model of single order to carry out modeling to it;
Step 45, the base frequency parameters model that the GMM model obtaining according to described step 43 and described step 44 obtain, obtains the transfer function of frequency spectrum parameter and base frequency parameters.
In this step, the derivation of the transfer function of frequency spectrum parameter is as follows:
Fig. 2 has described the relation between source speaker and two gaussian component of target speaker personal characteristics, uses respectively (μ
s, σ
s) and (μ
t, σ
t) represent, X represents source speaker's frequency spectrum parameter to be converted, Y represents the frequency spectrum parameter of the target speaker after conversion, the formula below can deriving from Fig. 2:
Thereby have:
Consider that, after the weighted sum of all gaussian component, the transfer function of frequency spectrum parameter is expressed as:
Wherein, p
i(X) be the posterior probability of i gaussian component of source speaker GMM model, Q represents the dimension of gaussian component,
average and the covariance matrix of i gaussian component of source speaker GMM model,
average and the covariance matrix of i gaussian component of target speaker GMM model.
In this step, adopt Gauss model transformation approach to obtain the transfer function of base frequency parameters, the method supposition fundamental frequency of source speaker and target speaker's fundamental frequency be Normal Distribution all, and the transfer function of described base frequency parameters is expressed as:
Wherein, μ
sand μ
tthe average that represents respectively source and target speaker speech pitch, σ
sand σ
tthe variance that represents source and target speaker speech pitch,
it is the fundamental frequency of source speaker's voice.
Step 5, is input to the speech characteristic parameter of voice to be converted in the transfer function that described step 4 obtains the speech characteristic parameter of the target speaker after being changed;
This step is further comprising the steps:
Step 51, the short-time spectrum of extraction source speaker voice to be converted and fundamental frequency F0;
In an embodiment of the present invention, use short-time spectrum and the fundamental frequency F0 of STRAIGHT extraction source speaker voice to be converted.
Step 52, goes out LPCC parameter by short-time spectrum envelope extraction;
Step 53, changes source speaker's LPCC parameter and fundamental frequency F0 according to described frequency spectrum parameter transfer function and base frequency parameters transfer function respectively, obtains target speaker's LPCC parameter and base frequency parameters.
Step 6, synthesizes the speech characteristic parameter of the target speaker after conversion, obtains target voice.
This step is further comprising the steps:
Step 61, the LPCC parameter revaluation based on after conversion goes out target speaker's short-time spectrum envelope;
Step 62, in conjunction with the fundamental frequency F0 after described short-time spectrum envelope and conversion, obtains having the voice of target speaker characteristic.
In described step 62, the fundamental frequency F0 by STRAIGHT platform for described short-time spectrum envelope and after changing synthesizes.
Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.
Claims (10)
1. the phonetics transfer method based on the non-parallel training of self-adaptation, is characterized in that, the method comprises the following steps:
Step 1 detects effective voice signal from the speech samples collecting, and described efficient voice signal is carried out to pre-service;
Step 2, for the efficient voice signal extraction speech characteristic parameter obtaining after pre-service;
Step 3, carries out UBM training based on described speech characteristic parameter, obtain one with the irrelevant UBM model of speaker;
Step 4, based on described UBM model, obtains the independent speaker speech model relevant with speaker, based on described independent speaker's speech model, obtains the transfer function of frequency spectrum parameter and base frequency parameters;
Step 5, is input to the speech characteristic parameter of voice to be converted in the transfer function that described step 4 obtains the speech characteristic parameter of the target speaker after being changed;
Step 6, synthesizes the speech characteristic parameter of the target speaker after conversion, obtains target voice.
2. method according to claim 1, is characterized in that, described pre-service includes but not limited to pre-emphasis, adds Hamming window and minute frame processing.
3. method according to claim 1, is characterized in that, described speech characteristic parameter includes but not limited to fundamental frequency, linear prediction cepstrum coefficient coefficient LPCC, Mel cepstral coefficients MFCC and line spectrum pair LSP.
4. method according to claim 1, is characterized in that, in described step 2, first obtains fundamental frequency F0 and the short-time spectrum parameter of every frame efficient voice signal; Then the short-time spectrum parameter based on trying to achieve is asked for the LPC coefficient of every frame voice signal; Then LPC coefficient is converted into LPCC coefficient.
5. method according to claim 1, is characterized in that, in described step 3, is carrying out UBM when training, first speak difference that human nature do not go up and the size of each speaker's training corpus of balance; Then merge the speech characteristic parameter of the training that is useful on, by EM Algorithm for Training, obtain UBM model.
6. method according to claim 1, is characterized in that, described step 4 is further comprising the steps:
Step 41, carries out respectively pre-service to source speaker and target speaker's training utterance;
Step 42, extracts respectively both LPCC parameter and base frequency parameters;
Step 43 based on LPCC parameter, obtains respectively source speaker and target speaker's GMM model from UBM model;
Step 44, asks for average and the variance of base frequency parameters, and uses the Gauss model of single order to carry out modeling to it;
Step 45, the base frequency parameters model that the GMM model obtaining according to described step 43 and described step 44 obtain, obtains the transfer function of frequency spectrum parameter and base frequency parameters.
7. method according to claim 6, is characterized in that, by MAP adaptive approach, obtains respectively source speaker and target speaker's GMM model from UBM model.
8. method according to claim 6, is characterized in that, the transfer function of described frequency spectrum parameter is expressed as:
Wherein, p
i(X) be the posterior probability of i gaussian component of source speaker GMM model, Q represents the dimension of gaussian component,
average and the covariance matrix of i gaussian component of source speaker GMM model,
average and the covariance matrix of i gaussian component of target speaker GMM model;
The transfer function of described base frequency parameters is expressed as:
Wherein, μ
sand μ
tthe average that represents respectively source and target speaker speech pitch, σ
sand σ
tthe variance that represents source and target speaker speech pitch,
it is the fundamental frequency of source speaker's voice.
9. method according to claim 1, is characterized in that, described step 5 is further comprising the steps:
Step 51, the short-time spectrum of extraction source speaker voice to be converted and fundamental frequency F0;
Step 52, goes out LPCC parameter by short-time spectrum envelope extraction;
Step 53, changes source speaker's LPCC parameter and fundamental frequency F0 according to described frequency spectrum parameter transfer function and base frequency parameters transfer function respectively, obtains target speaker's LPCC parameter and base frequency parameters.
10. method according to claim 1, is characterized in that, described step 6 is further comprising the steps:
Step 61, the LPCC parameter revaluation based on after conversion goes out target speaker's short-time spectrum envelope;
Step 62, in conjunction with the fundamental frequency F0 after described short-time spectrum envelope and conversion, obtains having the voice of target speaker characteristic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410377091.0A CN104123933A (en) | 2014-08-01 | 2014-08-01 | Self-adaptive non-parallel training based voice conversion method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410377091.0A CN104123933A (en) | 2014-08-01 | 2014-08-01 | Self-adaptive non-parallel training based voice conversion method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104123933A true CN104123933A (en) | 2014-10-29 |
Family
ID=51769323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410377091.0A Pending CN104123933A (en) | 2014-08-01 | 2014-08-01 | Self-adaptive non-parallel training based voice conversion method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104123933A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104464744A (en) * | 2014-11-19 | 2015-03-25 | 河海大学常州校区 | Cluster voice transforming method and system based on mixture Gaussian random process |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
CN105895080A (en) * | 2016-03-30 | 2016-08-24 | 乐视控股(北京)有限公司 | Voice recognition model training method, speaker type recognition method and device |
CN106448673A (en) * | 2016-09-18 | 2017-02-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Chinese electrolarynx speech conversion method |
CN107507619A (en) * | 2017-09-11 | 2017-12-22 | 厦门美图之家科技有限公司 | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing |
CN108108357A (en) * | 2018-01-12 | 2018-06-01 | 京东方科技集团股份有限公司 | Accent conversion method and device, electronic equipment |
CN108766465A (en) * | 2018-06-06 | 2018-11-06 | 华中师范大学 | A kind of digital audio based on ENF universal background models distorts blind checking method |
CN108777140A (en) * | 2018-04-27 | 2018-11-09 | 南京邮电大学 | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus |
WO2018223796A1 (en) * | 2017-06-07 | 2018-12-13 | 腾讯科技(深圳)有限公司 | Speech recognition method, storage medium, and speech recognition device |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109377986A (en) * | 2018-11-29 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of non-parallel corpus voice personalization conversion method |
CN109584893A (en) * | 2018-12-26 | 2019-04-05 | 南京邮电大学 | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN110060690A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN and ResNet |
CN110164414A (en) * | 2018-11-30 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Method of speech processing, device and smart machine |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751921A (en) * | 2009-12-16 | 2010-06-23 | 南京邮电大学 | Real-time voice conversion method under conditions of minimal amount of training data |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103021418A (en) * | 2012-12-13 | 2013-04-03 | 南京邮电大学 | Voice conversion method facing to multi-time scale prosodic features |
CN103280224A (en) * | 2013-04-24 | 2013-09-04 | 东南大学 | Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm |
-
2014
- 2014-08-01 CN CN201410377091.0A patent/CN104123933A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751921A (en) * | 2009-12-16 | 2010-06-23 | 南京邮电大学 | Real-time voice conversion method under conditions of minimal amount of training data |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103021418A (en) * | 2012-12-13 | 2013-04-03 | 南京邮电大学 | Voice conversion method facing to multi-time scale prosodic features |
CN103280224A (en) * | 2013-04-24 | 2013-09-04 | 东南大学 | Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm |
Non-Patent Citations (1)
Title |
---|
朱春雷: "优化自适应非平行训练语音转换算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104464744A (en) * | 2014-11-19 | 2015-03-25 | 河海大学常州校区 | Cluster voice transforming method and system based on mixture Gaussian random process |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
CN105390141B (en) * | 2015-10-14 | 2019-10-18 | 科大讯飞股份有限公司 | Sound converting method and device |
CN105895080A (en) * | 2016-03-30 | 2016-08-24 | 乐视控股(北京)有限公司 | Voice recognition model training method, speaker type recognition method and device |
CN106448673A (en) * | 2016-09-18 | 2017-02-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Chinese electrolarynx speech conversion method |
CN106448673B (en) * | 2016-09-18 | 2019-12-10 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | chinese electronic larynx speech conversion method |
WO2018223796A1 (en) * | 2017-06-07 | 2018-12-13 | 腾讯科技(深圳)有限公司 | Speech recognition method, storage medium, and speech recognition device |
CN107507619A (en) * | 2017-09-11 | 2017-12-22 | 厦门美图之家科技有限公司 | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing |
CN107507619B (en) * | 2017-09-11 | 2021-08-20 | 厦门美图之家科技有限公司 | Voice conversion method and device, electronic equipment and readable storage medium |
CN108108357A (en) * | 2018-01-12 | 2018-06-01 | 京东方科技集团股份有限公司 | Accent conversion method and device, electronic equipment |
CN108108357B (en) * | 2018-01-12 | 2022-08-09 | 京东方科技集团股份有限公司 | Accent conversion method and device and electronic equipment |
CN108777140A (en) * | 2018-04-27 | 2018-11-09 | 南京邮电大学 | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus |
CN108766465A (en) * | 2018-06-06 | 2018-11-06 | 华中师范大学 | A kind of digital audio based on ENF universal background models distorts blind checking method |
CN108766465B (en) * | 2018-06-06 | 2020-07-28 | 华中师范大学 | Digital audio tampering blind detection method based on ENF general background model |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109326283B (en) * | 2018-11-23 | 2021-01-26 | 南京邮电大学 | Many-to-many voice conversion method based on text encoder under non-parallel text condition |
CN109377986A (en) * | 2018-11-29 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of non-parallel corpus voice personalization conversion method |
CN109377986B (en) * | 2018-11-29 | 2022-02-01 | 四川长虹电器股份有限公司 | Non-parallel corpus voice personalized conversion method |
CN110164414B (en) * | 2018-11-30 | 2023-02-14 | 腾讯科技(深圳)有限公司 | Voice processing method and device and intelligent equipment |
CN110164414A (en) * | 2018-11-30 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Method of speech processing, device and smart machine |
CN109584893A (en) * | 2018-12-26 | 2019-04-05 | 南京邮电大学 | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition |
CN109584893B (en) * | 2018-12-26 | 2021-09-14 | 南京邮电大学 | VAE and i-vector based many-to-many voice conversion system under non-parallel text condition |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN109599091B (en) * | 2019-01-14 | 2021-01-26 | 南京邮电大学 | Star-WAN-GP and x-vector based many-to-many speaker conversion method |
CN110060690A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN and ResNet |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104123933A (en) | Self-adaptive non-parallel training based voice conversion method | |
Dave | Feature extraction methods LPC, PLP and MFCC in speech recognition | |
CN102509547B (en) | Method and system for voiceprint recognition based on vector quantization based | |
CN109767778B (en) | Bi-L STM and WaveNet fused voice conversion method | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
CN102332263B (en) | Close neighbor principle based speaker recognition method for synthesizing emotional model | |
Song et al. | Noise invariant frame selection: a simple method to address the background noise problem for text-independent speaker verification | |
CN105593936B (en) | System and method for text-to-speech performance evaluation | |
Kumar et al. | Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm | |
CN102568476B (en) | Voice conversion method based on self-organizing feature map network cluster and radial basis network | |
CN106653056A (en) | Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof | |
CN102426834B (en) | Method for testing rhythm level of spoken English | |
CN102810311B (en) | Speaker estimation method and speaker estimation equipment | |
Das et al. | Bangladeshi dialect recognition using Mel frequency cepstral coefficient, delta, delta-delta and Gaussian mixture model | |
CN105206257A (en) | Voice conversion method and device | |
CN113506562B (en) | End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features | |
CN102237083A (en) | Portable interpretation system based on WinCE platform and language recognition method thereof | |
CN103456302A (en) | Emotion speaker recognition method based on emotion GMM model weight synthesis | |
CN110047501A (en) | Multi-to-multi phonetics transfer method based on beta-VAE | |
CN101419800B (en) | Emotional speaker recognition method based on frequency spectrum translation | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
Gamit et al. | Isolated words recognition using mfcc lpc and neural network | |
Garg et al. | Survey on acoustic modeling and feature extraction for speech recognition | |
Kaur et al. | Genetic algorithm for combined speaker and speech recognition using deep neural networks | |
Wu et al. | The DKU-LENOVO Systems for the INTERSPEECH 2019 Computational Paralinguistic Challenge. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20141029 |
|
WD01 | Invention patent application deemed withdrawn after publication |