CN102306492A - Voice conversion method based on convolutive nonnegative matrix factorization - Google Patents
Voice conversion method based on convolutive nonnegative matrix factorization Download PDFInfo
- Publication number
- CN102306492A CN102306492A CN201110267425A CN201110267425A CN102306492A CN 102306492 A CN102306492 A CN 102306492A CN 201110267425 A CN201110267425 A CN 201110267425A CN 201110267425 A CN201110267425 A CN 201110267425A CN 102306492 A CN102306492 A CN 102306492A
- Authority
- CN
- China
- Prior art keywords
- voice
- straight
- conversion
- phoneme
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses a voice conversion method based on convolutive nonnegative matrix factorization. The method comprises the following steps: (1) training a transformation model through training data: carrying out time calibration and parameter decomposition of training voice data, analyzing STRAIGHT spectrum by using a convolutive nonnegative matrix factorization method, and analyzing pitch frequency of source voice and object voice; (2) converting new input voice based on a training model: carrying out parameter decomposition on source voice data A[c] to be converted by employing a STRAIGHT model, realizing sound channel frequency spectrum parameter conversion based on convolutive nonnegative matrix factorization, realizing conversion of the pitch frequency based on obtained mean value and variance in a training phase, and synthesizing voice after conversion, wherein the voice is voice after synthesis and conversion of the STRAIGHT spectrum S[Bc] which is obtained through conversion, the pitch frequency f[Bc] and original aperiodic component ap[Ac]. According to the invention, training effect of voice conversion is improved, and voice quality of conversion voice is improved.
Description
Technical field
The invention belongs to voice process technology field, particularly a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization.
Background technology
Voice conversion is a kind of personal characteristic information by changing in the speaker's voice signal of source, with the technology of target speaker's vocie personal characteristic information.Voice conversion suffers from wide application prospect in personalized man-machine interaction, military struggle, information security and multimedia recreation field.For example, being combined by digitizing the speech into speech synthesis system, you can realize that personalized speech is synthesized;Changed by voice, enemy commander's sound can be forged to send disinformation or order, enemy's operational commanding is upset;Given a lecture by the reproducible historical personage of voice conversion etc..
Voice is changed(Voice Conversion/Transformation)The technical research history of 20 years so far(Li Bo, Wang Chengyou, Cai Xuanping wait voices to change and correlation technique summary [J] communication journals, 2004 (05):109-118.), its earliest method is to be proposed by Abe et al. in 1988.Existing phonetics transfer method mainly includes:The method mapped based on vector quantization yardage(1. M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, "Voice conversion through vector quantization," ICASSP-88., 1988, pp. 655-658.), method based on gauss hybrid models(2. Y. Stylianou, O. Cappe and E. Moulines, "Continuous probabilistic transform for voice conversion," Speech and Audio Processing, IEEE Transactions on, vol. 6, pp. 131-142, 1998.), method based on HMM(3. E. K. Kim, S. Lee and Y. H. Oh, "Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker," in Proc. Eurospeech, Rhodes, Greece, 1997, pp. 2519-2522.), method based on frequency warping(4. D. Erro and A. Moreno, "Weighted Frequency Warping for Voice Conversion," in InterSpeech 2007 - EuroSpeech Antwerp, Belgium, 2007.)With the method based on artificial neural network(5. S. Desai, A. Black, B. Yegnanarayana, and K. Prahallad, "Spectral Mapping Using Artificial Neural Networks for Voice Conversion," Audio, Speech, and Language Processing, IEEE Transactions on, vol. PP, p. 1-1, 2010.).
Although proposing that the effect of voice conversion also reaches far away practical requirement for the existing a variety of methods of voice conversion.The problem of existing voice conversion method is present mainly has:
1. many phonetics transfer methods are built upon to after voice signal framing, under the framework of each frame independent process.Under this framework, the correlation of voice interframe is often ignored, so as to cause voice after conversion discontinuity occur and show, reduces the quality of voice after conversion.Method, the method based on gauss hybrid models and the method based on artificial neural network for example mapped based on vector quantization yardage;
2. the target of voice conversion is the correct speaker's personal characteristic information changed in voice, and existing voice conversion method is not separated the personal characteristic information of speaker's voice before conversion process is carried out from voice signal, but directly voice signal is handled.It so not only result in that conversion effect is unsatisfactory, simultaneously because changing other compositions in voice signal, cause the decline of voice quality after conversion.
Convolution Non-negative Matrix Factorization(Convolutive Nonnegative Matrix Factorization)A kind of non-negative matrix factorization method proposed for Speech processing, this method has used two-dimentional time-frequency base to replace original one-dimensional base vector, so as to preferably carry the timing dependence of voice signal on the premise of decomposition result nonnegativity is ensured.This method has more successful application in the separation for speaking human speech sound more(6. S. Paris, "Convolutive Speech Bases and Their Application to Supervised Speech Separation," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, pp. 1-12, 2007-01-01 2007.).Voice signal can be decomposed into the encoder matrix of one group of non-negative time-frequency base and this group of base by this method.The sub-spaces that obtained time-frequency base is regarded as carrying speaker characteristic are decomposed, and encoder matrix is then projection of the voice in each sub-spaces.Therefore the function that the speaker characteristic information in voice signal is separated from voice signal has been largely fulfilled by this isolation.Further, since convolution Non-negative Matrix Factorization can preferably take the timing dependence of voice signal into account relative to traditional Non-negative Matrix Factorization, so as to ensure the continuity of reconstructed voice.
But this method has the problem of decomposition result is not unique, i.e., obtained basic matrix is decomposed to same speech data under different primary condition not unique.Although this different expression form that not can be regarded as feature space uniquely, its application in voice conversion is limited.
The content of the invention
It is an object of the invention to provide a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization.The separation to personal characteristic information in speech channel frequency spectrum is realized using convolution Non-negative Matrix Factorization, and the correlation of voice sequential is effectively saved in separation process, on the premise of source speaker's voice and target speaker's voice spectrum convolution Non-negative Matrix Factorization result uniformity is ensured, the conversion for completing speech channel frequency spectrum is replaced by time-frequency base.And furthermore achieved that voice is changed on this basis, make the voice after conversion that there is higher quality and there is stronger similarity with target speaker in vocie personal feature.
The technical solution for realizing the object of the invention is:A kind of phonetics transfer method based on convolution Non-negative Matrix Factorization, step is as follows:
First, transformation model is trained by training data:
The first step:The time alignment and parameter decomposition of speech data are trained, for the parallel speech data used in training, i.e. source speaker and the voice pair of the identical content of target speaker, wherein source speaker voice is represented by, target speaker's voice is represented by, pass through the pitch period envelope of both STRAIGHT model extractions firstWith, calculated afterwards by pitch period envelope and primary speech signal for realizing that the fundamental tone that pitch synchronous splicing adding is handled marks pointWith;According to phoneme division information, with voice、Corresponding phoneme carry out fundamental tone mark Point matching for unit, afterwards again using phoneme as elementary cell, voice is realized using pitch synchronous splicing adding mode based on matching fundamental tone mark pointWithTime alignment, obtain the voice after time alignmentWith, use STRAIGHT models pairWithAnalyzed, obtain three groups of parameters:
Second step:STRAIGHT spectrums are analyzed using convolution non-negative matrix factorization method, i.e., it is right firstSTRAIGHT spectrumAnalyzed using convolution non-negative matrix factorization method, obtain its time-frequency baseAnd encoder matrix, pass through convolution Non-negative Matrix Factorization mode pair again afterwardsSTRAIGHT spectrumAnalyzed, now fixing its encoder matrix is, then can obtain its time-frequency base;
3rd step:The fundamental frequency of analysis source voice and target voice, i.e., by pairWithFundamental frequency informationWithAnalyzed, obtain the average and variance of both:、With、;
Secondly, new input voice is changed based on training pattern:
The first step:For source speech data to be convertedParameter decomposition is carried out using STRAIGHT models, its STRAIGHT spectrums are obtained, fundamental frequencyAnd aperiodic componentThree groups of parameters;
Second step:Pair the conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, i.e.,Analyzed using convolution Non-negative Matrix Factorization, now fixing its time-frequency base is, obtain corresponding encoder matrix, and then the STRAIGHT spectrums after being changed by equation below:
3rd step:The average and variance of the fundamental frequency obtained based on the training stage, realize the conversion of fundamental frequency:
4th step:Voice after synthesis conversion, i.e., composed by the STRAIGHT being converted to, fundamental frequencyAnd original aperiodic componentVoice after synthesis conversion.
The present invention compared with prior art, its remarkable advantage:(1)In the training stage, based on phoneme information, the matching of source speaker's voice and target speaker's voice is realized using pitch synchronous splicing adding method, makes the voice after matching that there is higher time match precision and voice quality, the training effect of voice conversion is improved;(2)Efficiently separating for personal characteristic information in vocal tract spectrum is realized by convolution non-negative matrix factorization method, transfer process is implemented for personal characteristic information, so as to improve the conversion effect of voice.In addition, convolution non-negative matrix factorization method effectively saves the relativity of time domain of vocal tract spectrum parameter, make reconstructed voice that there is more preferable continuity, improve the voice quality of conversion voice.
The present invention is described in further detail below in conjunction with the accompanying drawings.
Brief description of the drawings
Fig. 1 is the phonetics transfer method schematic diagram of the invention based on convolution Non-negative Matrix Factorization.
Fig. 2 is to carry out the time alignment based on phoneme for training voice to handle schematic diagram.
Fig. 3 is voice fundamental mark point schematic diagram.
Fig. 4 is the training voice time-frequency base calculation process schematic diagram based on convolution Non-negative Matrix Factorization.
Fig. 5 is the STRAIGHT time spectrum frequency base schematic diagrames being made up of 40 subbases.
Fig. 6 is the frequency spectrum flow path switch schematic diagram based on convolution Non-negative Matrix Factorization.
Embodiment
With reference to Fig. 1, the phonetics transfer method of the invention based on convolution Non-negative Matrix Factorization, step is as follows:
Training stage:Transformation model is trained by training data.
The first step, train the time alignment and parameter decomposition of speech data:
(1)The time alignment of speech data, as shown in Figure 2.The source speaker's voice concentrated first to training dataWith target speaker's voice, analyzed by STRAIGHT models, obtain the pitch of both each sampled points, i.e. pitch period envelopeWith:
WhereinWithSource speaker's voice is represented respectivelyWith target speaker's voiceIn the sampled point number that includes.
Pitch period herein is represented in sampled point number form, and round is carried out to fractional part.Because voiceless sound section and unvoiced segments are without obvious pitch period, therefore its pitch period is fixed as, whereinFor speech sample frequency, "" represent to take no more thanMaximum integer.Based on pitch contour, using pitch period length as frame length, to voice、Carry out framing.With voiceExemplified by, framing step is as follows:
From the 1st sampled point of voiceStart, with the pitch period length corresponding to itThe first frame is determined for frame length, wherein.Afterwards with voiceSampled pointFor the second frame start position, the second frame is determined by frame length of the pitch period corresponding to it, wherein.By that analogy, toFrame, the starting point of current speech is obtained based on upper framing result, and with the pitch period length corresponding to itThe framing result of current speech is obtained for frame length, wherein.This process is repeated, until voice ending, if being obtainedFrame voice.
Complete after framing, centered on each frame speech centre point, with most long frame lengthBuilt for lengthSpeech data matrix, its is each to be classified as a frame voice, and passes through Hanning windows to every column data and carry out windowing process.When building matrix, respectively using voice starting and ending point polishing at curtailment that voice is originated and ended up.
To matrixSearched for by column, a point is determined in each row, so as to constitute the fundamental tone mark locus of points through each row, make each point value sum on track maximum.The row position difference of selected point is not more than 6 rows between limitation adjacent column in search procedure.It can obtain marking point for the fundamental tone of PSOLA processing by the method, these mark points are in Amplitude maxima position in voice voiced segments.The fundamental tone mark point schematic diagram that one section of voice is obtained by the above method is given in Fig. 3.Same method can arrive voiceFundamental tone mark point。
According to phoneme division information, the matching corresponding relation of fundamental tone mark point in source speaker and target speaker's phoneme of speech sound is set up:.WhereinWithThe is represented in source speaker and target speaker's voice respectivelyThe fundamental tone mark point information that individual phoneme is included, concrete form is as follows:
Here,WithRespectively in source speaker and target speaker's voiceIn individual phonemeWithIndividual fundamental tone marks point.WithRespectivelyThe fundamental tone mark that both include in individual phoneme is counted out.
Based on training voice、The fundamental tone mark point information of middle matching phoneme, the voice duration for realizing source speaker phoneme corresponding with target speaker using PSOLA methods aligns.The frame length of PSOLA processing is taken as the triple-length of the corresponding pitch period of current pitch mark point.In alignment procedure, on the basis of matching in phoneme duration compared with minor element, another phoneme is compressed by PSOLA methods and realizes alignment.Because PSOLA methods are the progress duration adjustment in units of pitch period, Adjustment precision only can guarantee that in a pitch period length, therefore the different information adjusted to current matching phoneme will be included in next matching phoneme duration alignment and handle.And then realize alignment by truncating mode for the unvoiced segments between phoneme in speech data.
By above-mentioned steps to voice、In after unvoiced segments are handled between each phoneme and phoneme, obtained source speaker's voice of elapsed time alignmentWith target speaker's voice。
(2)Speech parameter is decomposed.Training voice after being aligned for the elapsed time, parameter decomposition is carried out using STRAIGHT models.Source speaker's voice can be respectively obtained by decompositionWith target speaker's voiceThree groups of parameters:
A) the STRAIGHT spectrums of vocal tract spectrum feature are characterized, it is two-dimensional matrix:
Each of which row represent the STRAIGHT spectrum of a frame voice, include altogetherIndividual spectrum point, takes.Whole section of voice is divided intoFrame is analyzed, at a distance of 10ms between each frame center's point.Here useExpression source speaker's voice STRAIGHT is composed,Represent target speaker's voice STRAIGHT spectrums;
B) fundamental frequency of voice is trained,, whereinFor voiceThe fundamental frequency of frame, its spectrum the with STRAIGHTRow are corresponding.Here useExpression source speaker's voice fundamental frequency,Represent target speaker's voice fundamental frequency;
C) aperiodic component, it influences smaller to voice in the transfer to characterize the matrix of the aperiodic information characteristic of phonological component, thus without conversion processing.
Second step, voice STRAIGHT spectrums are analyzed by convolution non-negative matrix factorization method, the time-frequency base of source speaker and target speaker's voice STRAIGHT spectrums are obtained, as shown in Figure 4.Analysis is comprised the following steps that:
(1)Source speaker STRAIGHT is composed using convolution non-negative matrix factorization methodAnalyzed, can obtain following decomposition result:
WhereinForTime-frequency base, be specificallyMatrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, byIndividual such base vector constitutes a time-frequency base, is obtainedIndividual such time-frequency base.Take,。Represent to encoder matrixMoved to right with column vector formIndividual unit, concrete form is as follows:
If
WhereinFor encoder matrix Individual column vector, andIn include altogetherIndividual column vector.Then whenWhen:
C) it is based onTo time-frequency baseIt is updated, renewal process is directed toCalculated successively:
Wherein "" represent two matrixes between element multiplication,It is all 1 to represent elementMatrix.
After the completion of time-frequency base updates, encoder matrix is updated by following formula:
D) judge whether iterations reaches maximum iteration 300 times, or speech reconstruction error is less than 10-5, reconstructed error determines by following formula:
When above-mentioned two condition is all unsatisfactory for, returns to step b) and continue iteration, iterative cycles are otherwise terminated, into next step e).
Fig. 5 is to decompose one section of obtained STRAIGHT time spectrum frequency base schematic diagram by the above method.
The time-frequency base obtained after decompositionA proper subspace of source speaker STRAIGHT spectrums is can be regarded as, the personal characteristic information of source speaker's vocal tract spectrum is carried, and encoder matrixIt is then to compose in subspaceOn projection, carry the change of time-frequency base in time.The source speaker's voice and target speaker's voice concentrated due to training voice have passed through accurate time alignment, it is believed that both only have differences in speaker characteristic information, therefore after Non-negative Matrix Factorization, both encoder matrixs are identicals.
(2)Source speaker STRAIGHT is composed using convolution non-negative matrix factorization methodAnalyzed, analysis method with(1)Middle analysis method is identical, but now regular coding matrix is, then can obtainTime-frequency base。
3rd step, analyze source voice and target voice pitch frequency parameter.Analyze and 1 rank of fundamental frequency and 2 rank statistics in source speaker and target speaker training voice are obtained by STRAIGHT models, i.e.,WithAverage and variance、With、。
The conversion stage,New input voice is changed based on training pattern.
The first step:To source speech data to be convertedParameter decomposition is carried out using STRAIGHT models(Parameter decomposition method is as the parameter decomposition method of training stage), obtain its STRAIGHT spectrums, fundamental frequencyAnd aperiodic componentThree groups of parameters;
Second step:The conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, as shown in Figure 6.It is right firstAnalyzed using convolution Non-negative Matrix Factorization.In analysis method and training stage second step(1)Methods described is identical, but now sets time-frequency base and obtain as the training stage, so as to can obtain corresponding encoder matrix.Analyzed from before, speaker's personal characteristic information is carried in time-frequency base, thus when realizing conversion, usedReplace, afterwards and encoder matrixConvolution algorithm is carried out, the STRAIGHT spectrums after being changed are shown below:
3rd step:Realize the conversion of fundamental frequency.For fundamental frequency to be converted, the source speaker obtained by the training stage and the average and variance of target speaker's fundamental frequency are realized according to following formula and changed:
4th step:Voice after synthesis conversion.Use the STRAIGHT spectrums after conversion, fundamental frequency after conversion, and the aperiodic component obtained during signal decomposition, according to STRAIGHT Model voice composition algorithms, you can the speech data after being changed:
Claims (4)
1. a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization, it is characterised in that step is as follows:
First, transformation model is trained by training data:
The first step:The time alignment and parameter decomposition of speech data are trained, for the parallel speech data used in training, i.e. source speaker and the voice pair of the identical content of target speaker, wherein source speaker voice is represented by, target speaker's voice is represented by, pass through the pitch period envelope of both STRAIGHT model extractions firstWith, calculated afterwards by pitch period envelope and primary speech signal for realizing that the fundamental tone that pitch synchronous splicing adding is handled marks pointWith;According to phoneme division information, with voice、Corresponding phoneme carry out fundamental tone mark Point matching for unit, afterwards again using phoneme as elementary cell, voice is realized using pitch synchronous splicing adding mode based on matching fundamental tone mark pointWithTime alignment, obtain the voice after time alignmentWith, use STRAIGHT models pairWithAnalyzed, obtain three groups of parameters:
(2)Fundamental frequency、;
Second step:STRAIGHT spectrums are analyzed using convolution non-negative matrix factorization method, i.e., it is right firstSTRAIGHT spectrumAnalyzed using convolution non-negative matrix factorization method, obtain its time-frequency baseAnd encoder matrix, pass through convolution Non-negative Matrix Factorization mode pair again afterwardsSTRAIGHT spectrumAnalyzed, now fixing its encoder matrix is, then can obtain its time-frequency base;
3rd step:The fundamental frequency of analysis source voice and target voice, i.e., by pairWithFundamental frequency informationWithAnalyzed, obtain the average and variance of both:、With、;
Secondly, new input voice is changed based on training pattern:
The first step:For source speech data to be convertedParameter decomposition is carried out using STRAIGHT models, its STRAIGHT spectrums are obtained, fundamental frequencyAnd aperiodic componentThree groups of parameters;
Second step:Pair the conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, i.e.,Analyzed using convolution Non-negative Matrix Factorization, now fixing its time-frequency base is, obtain corresponding encoder matrix, and then the STRAIGHT spectrums after being changed by equation below:
3rd step:The average and variance of the fundamental frequency obtained based on the training stage, realize the conversion of fundamental frequency:
2. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterised in that based on pitch contour, using pitch period length as frame length, to voice、Carry out time alignment:
1)The framing stage
From the 1st sampled point of voiceStart, with the pitch period length corresponding to itThe first frame is determined for frame length, wherein, afterwards with voiceSampled pointFor the second frame start position, the second frame is determined by frame length of the pitch period corresponding to it, wherein, by that analogy, toFrame, the starting point of current speech is obtained based on upper framing result, and with the pitch period length corresponding to itThe framing result of current speech is obtained for frame length, wherein, this process is repeated, until voice ending, if being obtainedFrame voice;
Complete after framing, centered on each frame speech centre point, with most long frame lengthBuilt for lengthSpeech data matrix, its is each to be classified as a frame voice, and carries out windowing process by Hanning windows to every column data, when building matrix, respectively using voice starting and ending point polishing at curtailment that voice is originated and ended up;
To matrixSearched for by column, a point is determined in each row, so as to constitute the fundamental tone mark locus of points through each rowMake each point value sum on track maximum, the row position difference of selected point is not more than 6 rows between limitation adjacent column in search procedure, obtain marking point for the fundamental tone of pitch synchronous splicing adding processing by the method, these mark points are in Amplitude maxima position in voice voiced segments, and same method can arrive voiceFundamental tone mark point;
2)Matching stage
According to phoneme division information, the matching corresponding relation of fundamental tone mark point in source speaker and target speaker's phoneme of speech sound is set up:, whereinWithThe is represented in source speaker and target speaker's voice respectivelyThe fundamental tone mark point information that individual phoneme is included, concrete form is as follows:
Here,WithRespectively in source speaker and target speaker's voiceIn individual phonemeWithIndividual fundamental tone marks point,WithRespectivelyThe fundamental tone mark that both include in individual phoneme is counted out;
3)Alignment stage
Based on training voice、The fundamental tone mark point information of middle matching phoneme, the voice duration alignment of source speaker phoneme corresponding with target speaker is realized using pitch synchronous splicing adding method, the frame length of pitch synchronous splicing adding processing is taken as the triple-length of the corresponding pitch period of current pitch mark point;In alignment procedure, on the basis of matching in phoneme duration compared with minor element, another phoneme is compressed by pitch synchronous splicing adding method and realizes alignment;Because PSOLA methods are the progress duration adjustment in units of pitch period, Adjustment precision only can guarantee that in a pitch period length, therefore the different information that current matching phoneme is adjusted will be included in next matching phoneme duration alignment and handled, and then alignment is realized by truncating mode for the unvoiced segments between phoneme in speech data;
3. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterized in that the speech parameter of training stage is decomposed, training voice after being aligned for the elapsed time, carries out parameter decomposition using STRAIGHT models, source speaker's voice is respectively obtained by decompositionWith target speaker's voiceThree groups of parameters:
1)The STRAIGHT spectrums of vocal tract spectrum feature are characterized, it is two-dimensional matrix:
Each of which row represent the STRAIGHT spectrum of a frame voice, include altogetherIndividual spectrum point, takes, whole section of voice be divided intoFrame is analyzed, and at a distance of 10ms between each frame center's point, is used hereExpression source speaker's voice STRAIGHT is composed,Represent target speaker's voice STRAIGHT spectrums;
2)The fundamental frequency of voice is trained,, whereinFor voiceThe fundamental frequency of frame, its spectrum the with STRAIGHTRow are corresponding, use hereExpression source speaker's voice fundamental frequency,Represent target speaker's voice fundamental frequency;
4. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterised in that the analytical procedure of training stage is as follows:
1)Source speaker STRAIGHT is composed using convolution non-negative matrix factorization methodAnalyzed, obtain following decomposition result:
WhereinForTime-frequency base, be specificallyMatrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, byIndividual such base vector constitutes a time-frequency base, is obtainedIndividual such time-frequency base, takes,,Represent to encoder matrixMoved to right with column vector formIndividual unit, concrete form is as follows:
If
WhereinFor encoder matrix Individual column vector, andIn include altogetherIndividual column vector, then whenWhen:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110267425A CN102306492B (en) | 2011-09-09 | 2011-09-09 | Voice conversion method based on convolutive nonnegative matrix factorization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110267425A CN102306492B (en) | 2011-09-09 | 2011-09-09 | Voice conversion method based on convolutive nonnegative matrix factorization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102306492A true CN102306492A (en) | 2012-01-04 |
CN102306492B CN102306492B (en) | 2012-09-12 |
Family
ID=45380342
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110267425A Expired - Fee Related CN102306492B (en) | 2011-09-09 | 2011-09-09 | Voice conversion method based on convolutive nonnegative matrix factorization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102306492B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102610236A (en) * | 2012-02-29 | 2012-07-25 | 山东大学 | Method for improving voice quality of throat microphone |
CN102855884A (en) * | 2012-09-11 | 2013-01-02 | 中国人民解放军理工大学 | Speech time scale modification method based on short-term continuous nonnegative matrix decomposition |
CN103020017A (en) * | 2012-12-05 | 2013-04-03 | 湖州师范学院 | Non-negative matrix factorization method of popular regularization and authentication information maximization |
CN105206259A (en) * | 2015-11-03 | 2015-12-30 | 常州工学院 | Voice conversion method |
CN105206257A (en) * | 2015-10-14 | 2015-12-30 | 科大讯飞股份有限公司 | Voice conversion method and device |
CN105930308A (en) * | 2016-04-14 | 2016-09-07 | 中国科学院西安光学精密机械研究所 | Nonnegative matrix factorization method based on low-rank recovery |
CN107221321A (en) * | 2017-03-27 | 2017-09-29 | 杭州电子科技大学 | A kind of phonetics transfer method being used between any source and target voice |
CN107464569A (en) * | 2017-07-04 | 2017-12-12 | 清华大学 | Vocoder |
CN107785030A (en) * | 2017-10-18 | 2018-03-09 | 杭州电子科技大学 | A kind of phonetics transfer method |
US10055479B2 (en) | 2015-01-12 | 2018-08-21 | Xerox Corporation | Joint approach to feature and document labeling |
CN109712634A (en) * | 2018-12-24 | 2019-05-03 | 东北大学 | A kind of automatic sound conversion method |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | A kind of phonetics transfer method merging Bi-LSTM and WaveNet |
CN110148424A (en) * | 2019-05-08 | 2019-08-20 | 北京达佳互联信息技术有限公司 | Method of speech processing, device, electronic equipment and storage medium |
CN111899716A (en) * | 2020-08-03 | 2020-11-06 | 北京帝派智能科技有限公司 | Speech synthesis method and system |
CN112735434A (en) * | 2020-12-09 | 2021-04-30 | 中国人民解放军陆军工程大学 | Voice communication method and system with voiceprint cloning function |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101278282A (en) * | 2004-12-23 | 2008-10-01 | 剑桥显示技术公司 | Digital signal processing methods and apparatus |
CN101441872A (en) * | 2007-11-19 | 2009-05-27 | 三菱电机株式会社 | Denoising acoustic signals using constrained non-negative matrix factorization |
-
2011
- 2011-09-09 CN CN201110267425A patent/CN102306492B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101278282A (en) * | 2004-12-23 | 2008-10-01 | 剑桥显示技术公司 | Digital signal processing methods and apparatus |
CN101441872A (en) * | 2007-11-19 | 2009-05-27 | 三菱电机株式会社 | Denoising acoustic signals using constrained non-negative matrix factorization |
Non-Patent Citations (3)
Title |
---|
《2009 Fifth International Conference on Natural Computation》 20091231 Zhang ye etal. Blind Separation of Convolutive Mixed Source Singnals by Using Robust Nonnegative Matrix Factorizaion , * |
《信号处理》 20070831 闵刚 等 分段声码器中的语音分段算法研究 第23卷, 第4A期 * |
《计算机工程与设计》 20110131 刘伯权 等 采用非负矩阵分解的语音盲分离 , 第1期 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102610236A (en) * | 2012-02-29 | 2012-07-25 | 山东大学 | Method for improving voice quality of throat microphone |
CN102855884A (en) * | 2012-09-11 | 2013-01-02 | 中国人民解放军理工大学 | Speech time scale modification method based on short-term continuous nonnegative matrix decomposition |
CN102855884B (en) * | 2012-09-11 | 2014-08-13 | 中国人民解放军理工大学 | Speech time scale modification method based on short-term continuous nonnegative matrix decomposition |
CN103020017A (en) * | 2012-12-05 | 2013-04-03 | 湖州师范学院 | Non-negative matrix factorization method of popular regularization and authentication information maximization |
US10055479B2 (en) | 2015-01-12 | 2018-08-21 | Xerox Corporation | Joint approach to feature and document labeling |
CN105206257A (en) * | 2015-10-14 | 2015-12-30 | 科大讯飞股份有限公司 | Voice conversion method and device |
CN105206257B (en) * | 2015-10-14 | 2019-01-18 | 科大讯飞股份有限公司 | A kind of sound converting method and device |
CN105206259A (en) * | 2015-11-03 | 2015-12-30 | 常州工学院 | Voice conversion method |
CN105930308A (en) * | 2016-04-14 | 2016-09-07 | 中国科学院西安光学精密机械研究所 | Nonnegative matrix factorization method based on low-rank recovery |
CN105930308B (en) * | 2016-04-14 | 2019-01-15 | 中国科学院西安光学精密机械研究所 | The non-negative matrix factorization method restored based on low-rank |
CN107221321A (en) * | 2017-03-27 | 2017-09-29 | 杭州电子科技大学 | A kind of phonetics transfer method being used between any source and target voice |
CN107464569A (en) * | 2017-07-04 | 2017-12-12 | 清华大学 | Vocoder |
CN107785030A (en) * | 2017-10-18 | 2018-03-09 | 杭州电子科技大学 | A kind of phonetics transfer method |
CN107785030B (en) * | 2017-10-18 | 2021-04-30 | 杭州电子科技大学 | Voice conversion method |
CN109712634A (en) * | 2018-12-24 | 2019-05-03 | 东北大学 | A kind of automatic sound conversion method |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | A kind of phonetics transfer method merging Bi-LSTM and WaveNet |
CN109767778B (en) * | 2018-12-27 | 2020-07-31 | 中国人民解放军陆军工程大学 | Bi-L STM and WaveNet fused voice conversion method |
CN110148424A (en) * | 2019-05-08 | 2019-08-20 | 北京达佳互联信息技术有限公司 | Method of speech processing, device, electronic equipment and storage medium |
CN111899716A (en) * | 2020-08-03 | 2020-11-06 | 北京帝派智能科技有限公司 | Speech synthesis method and system |
CN112735434A (en) * | 2020-12-09 | 2021-04-30 | 中国人民解放军陆军工程大学 | Voice communication method and system with voiceprint cloning function |
Also Published As
Publication number | Publication date |
---|---|
CN102306492B (en) | 2012-09-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102306492A (en) | Voice conversion method based on convolutive nonnegative matrix factorization | |
Han et al. | Semantic-preserved communication system for highly efficient speech transmission | |
CN109767778B (en) | Bi-L STM and WaveNet fused voice conversion method | |
CN109147758A (en) | A kind of speaker's sound converting method and device | |
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN108847249A (en) | Sound converts optimization method and system | |
CN105654939B (en) | A kind of phoneme synthesizing method based on sound vector text feature | |
CN105869624A (en) | Method and apparatus for constructing speech decoding network in digital speech recognition | |
CN109272992A (en) | A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model | |
CN103035236B (en) | High-quality voice conversion method based on modeling of signal timing characteristics | |
CN104272382A (en) | Method and system for template-based personalized singing synthesis | |
CN110322900A (en) | A kind of method of phonic signal character fusion | |
CN111128211B (en) | Voice separation method and device | |
Cooper et al. | Can speaker augmentation improve multi-speaker end-to-end TTS? | |
CN106782599A (en) | The phonetics transfer method of post filtering is exported based on Gaussian process | |
Denisov et al. | End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning | |
CN103413548B (en) | A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine | |
Oura et al. | Deep neural network based real-time speech vocoder with periodic and aperiodic inputs | |
CN111488486B (en) | Electronic music classification method and system based on multi-sound-source separation | |
CN103886859B (en) | Phonetics transfer method based on one-to-many codebook mapping | |
Chen et al. | An investigation of implementation and performance analysis of DNN based speech synthesis system | |
CN107785030A (en) | A kind of phonetics transfer method | |
Raju et al. | Application of prosody modification for speech recognition in different emotion conditions | |
Asakawa et al. | Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics | |
Quillen | Autoregressive HMM speech synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120912 Termination date: 20140909 |
|
EXPY | Termination of patent right or utility model |