CN102306492A - Voice conversion method based on convolutive nonnegative matrix factorization - Google Patents

Voice conversion method based on convolutive nonnegative matrix factorization Download PDF

Info

Publication number
CN102306492A
CN102306492A CN201110267425A CN201110267425A CN102306492A CN 102306492 A CN102306492 A CN 102306492A CN 201110267425 A CN201110267425 A CN 201110267425A CN 201110267425 A CN201110267425 A CN 201110267425A CN 102306492 A CN102306492 A CN 102306492A
Authority
CN
China
Prior art keywords
voice
straight
conversion
phoneme
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201110267425A
Other languages
Chinese (zh)
Other versions
CN102306492B (en
Inventor
张雄伟
孙健
曹铁勇
孙新建
黄建军
杨吉斌
邹霞
贾冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PLA University of Science and Technology
Original Assignee
PLA University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLA University of Science and Technology filed Critical PLA University of Science and Technology
Priority to CN201110267425A priority Critical patent/CN102306492B/en
Publication of CN102306492A publication Critical patent/CN102306492A/en
Application granted granted Critical
Publication of CN102306492B publication Critical patent/CN102306492B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a voice conversion method based on convolutive nonnegative matrix factorization. The method comprises the following steps: (1) training a transformation model through training data: carrying out time calibration and parameter decomposition of training voice data, analyzing STRAIGHT spectrum by using a convolutive nonnegative matrix factorization method, and analyzing pitch frequency of source voice and object voice; (2) converting new input voice based on a training model: carrying out parameter decomposition on source voice data A[c] to be converted by employing a STRAIGHT model, realizing sound channel frequency spectrum parameter conversion based on convolutive nonnegative matrix factorization, realizing conversion of the pitch frequency based on obtained mean value and variance in a training phase, and synthesizing voice after conversion, wherein the voice is voice after synthesis and conversion of the STRAIGHT spectrum S[Bc] which is obtained through conversion, the pitch frequency f[Bc] and original aperiodic component ap[Ac]. According to the invention, training effect of voice conversion is improved, and voice quality of conversion voice is improved.

Description

Phonetics transfer method based on convolution Non-negative Matrix Factorization
Technical field
The invention belongs to voice process technology field, particularly a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization.
Background technology
Voice conversion is a kind of personal characteristic information by changing in the speaker's voice signal of source, with the technology of target speaker's vocie personal characteristic information.Voice conversion suffers from wide application prospect in personalized man-machine interaction, military struggle, information security and multimedia recreation field.For example, being combined by digitizing the speech into speech synthesis system, you can realize that personalized speech is synthesized;Changed by voice, enemy commander's sound can be forged to send disinformation or order, enemy's operational commanding is upset;Given a lecture by the reproducible historical personage of voice conversion etc..
Voice is changed(Voice Conversion/Transformation)The technical research history of 20 years so far(Li Bo, Wang Chengyou, Cai Xuanping wait voices to change and correlation technique summary [J] communication journals, 2004 (05):109-118.), its earliest method is to be proposed by Abe et al. in 1988.Existing phonetics transfer method mainly includes:The method mapped based on vector quantization yardage(1. M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, "Voice conversion through vector quantization," ICASSP-88., 1988, pp. 655-658.), method based on gauss hybrid models(2. Y. Stylianou, O. Cappe and E. Moulines, "Continuous probabilistic transform for voice conversion," Speech and Audio Processing, IEEE Transactions on, vol. 6, pp. 131-142, 1998.), method based on HMM(3. E. K. Kim, S. Lee and Y. H. Oh, "Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker," in Proc. Eurospeech, Rhodes, Greece, 1997, pp. 2519-2522.), method based on frequency warping(4. D. Erro and A. Moreno, "Weighted Frequency Warping for Voice Conversion," in InterSpeech 2007 - EuroSpeech Antwerp, Belgium, 2007.)With the method based on artificial neural network(5. S. Desai, A. Black, B. Yegnanarayana, and K. Prahallad, "Spectral Mapping Using Artificial Neural Networks for Voice Conversion," Audio, Speech, and Language Processing, IEEE Transactions on, vol. PP, p. 1-1, 2010.).
Although proposing that the effect of voice conversion also reaches far away practical requirement for the existing a variety of methods of voice conversion.The problem of existing voice conversion method is present mainly has:
1. many phonetics transfer methods are built upon to after voice signal framing, under the framework of each frame independent process.Under this framework, the correlation of voice interframe is often ignored, so as to cause voice after conversion discontinuity occur and show, reduces the quality of voice after conversion.Method, the method based on gauss hybrid models and the method based on artificial neural network for example mapped based on vector quantization yardage;
2. the target of voice conversion is the correct speaker's personal characteristic information changed in voice, and existing voice conversion method is not separated the personal characteristic information of speaker's voice before conversion process is carried out from voice signal, but directly voice signal is handled.It so not only result in that conversion effect is unsatisfactory, simultaneously because changing other compositions in voice signal, cause the decline of voice quality after conversion.
Convolution Non-negative Matrix Factorization(Convolutive Nonnegative Matrix Factorization)A kind of non-negative matrix factorization method proposed for Speech processing, this method has used two-dimentional time-frequency base to replace original one-dimensional base vector, so as to preferably carry the timing dependence of voice signal on the premise of decomposition result nonnegativity is ensured.This method has more successful application in the separation for speaking human speech sound more(6. S. Paris, "Convolutive Speech Bases and Their Application to Supervised Speech Separation," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, pp. 1-12, 2007-01-01 2007.).Voice signal can be decomposed into the encoder matrix of one group of non-negative time-frequency base and this group of base by this method.The sub-spaces that obtained time-frequency base is regarded as carrying speaker characteristic are decomposed, and encoder matrix is then projection of the voice in each sub-spaces.Therefore the function that the speaker characteristic information in voice signal is separated from voice signal has been largely fulfilled by this isolation.Further, since convolution Non-negative Matrix Factorization can preferably take the timing dependence of voice signal into account relative to traditional Non-negative Matrix Factorization, so as to ensure the continuity of reconstructed voice.
But this method has the problem of decomposition result is not unique, i.e., obtained basic matrix is decomposed to same speech data under different primary condition not unique.Although this different expression form that not can be regarded as feature space uniquely, its application in voice conversion is limited.
The content of the invention
It is an object of the invention to provide a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization.The separation to personal characteristic information in speech channel frequency spectrum is realized using convolution Non-negative Matrix Factorization, and the correlation of voice sequential is effectively saved in separation process, on the premise of source speaker's voice and target speaker's voice spectrum convolution Non-negative Matrix Factorization result uniformity is ensured, the conversion for completing speech channel frequency spectrum is replaced by time-frequency base.And furthermore achieved that voice is changed on this basis, make the voice after conversion that there is higher quality and there is stronger similarity with target speaker in vocie personal feature.
The technical solution for realizing the object of the invention is:A kind of phonetics transfer method based on convolution Non-negative Matrix Factorization, step is as follows:
First, transformation model is trained by training data:
The first step:The time alignment and parameter decomposition of speech data are trained, for the parallel speech data used in training, i.e. source speaker and the voice pair of the identical content of target speaker, wherein source speaker voice is represented by
Figure 2011102674255100002DEST_PATH_IMAGE002
, target speaker's voice is represented by
Figure 2011102674255100002DEST_PATH_IMAGE004
, pass through the pitch period envelope of both STRAIGHT model extractions first
Figure 2011102674255100002DEST_PATH_IMAGE006
With
Figure 2011102674255100002DEST_PATH_IMAGE008
, calculated afterwards by pitch period envelope and primary speech signal for realizing that the fundamental tone that pitch synchronous splicing adding is handled marks point
Figure 2011102674255100002DEST_PATH_IMAGE010
With
Figure 2011102674255100002DEST_PATH_IMAGE012
;According to phoneme division information, with voice
Figure 534603DEST_PATH_IMAGE002
Figure 682426DEST_PATH_IMAGE004
Corresponding phoneme carry out fundamental tone mark Point matching for unit, afterwards again using phoneme as elementary cell, voice is realized using pitch synchronous splicing adding mode based on matching fundamental tone mark point
Figure 547614DEST_PATH_IMAGE002
With
Figure 624154DEST_PATH_IMAGE004
Time alignment, obtain the voice after time alignment
Figure 2011102674255100002DEST_PATH_IMAGE014
With, use STRAIGHT models pair
Figure 2011102674255100002DEST_PATH_IMAGE018
With
Figure 82948DEST_PATH_IMAGE016
Analyzed, obtain three groups of parameters:
(1)Characterize the STRAIGHT spectrums of tract characteristics
Figure 2011102674255100002DEST_PATH_IMAGE020
Figure 2011102674255100002DEST_PATH_IMAGE022
(2)Fundamental frequency
Figure 2011102674255100002DEST_PATH_IMAGE024
(3)Aperiodic component
Figure 2011102674255100002DEST_PATH_IMAGE028
Figure 2011102674255100002DEST_PATH_IMAGE030
; 
Second step:STRAIGHT spectrums are analyzed using convolution non-negative matrix factorization method, i.e., it is right first
Figure 35728DEST_PATH_IMAGE014
STRAIGHT spectrum
Figure 958685DEST_PATH_IMAGE020
Analyzed using convolution non-negative matrix factorization method, obtain its time-frequency base
Figure 2011102674255100002DEST_PATH_IMAGE032
And encoder matrix
Figure 2011102674255100002DEST_PATH_IMAGE034
, pass through convolution Non-negative Matrix Factorization mode pair again afterwardsSTRAIGHT spectrum
Figure 214534DEST_PATH_IMAGE022
Analyzed, now fixing its encoder matrix is
Figure 267940DEST_PATH_IMAGE034
, then can obtain its time-frequency base
Figure 2011102674255100002DEST_PATH_IMAGE036
3rd step:The fundamental frequency of analysis source voice and target voice, i.e., by pairWith
Figure 962282DEST_PATH_IMAGE016
Fundamental frequency information
Figure 192406DEST_PATH_IMAGE024
With
Figure 49503DEST_PATH_IMAGE026
Analyzed, obtain the average and variance of both:
Figure 2011102674255100002DEST_PATH_IMAGE038
With
Figure 2011102674255100002DEST_PATH_IMAGE042
Figure 2011102674255100002DEST_PATH_IMAGE044
Secondly, new input voice is changed based on training pattern:
The first step:For source speech data to be converted
Figure 2011102674255100002DEST_PATH_IMAGE046
Parameter decomposition is carried out using STRAIGHT models, its STRAIGHT spectrums are obtained
Figure 2011102674255100002DEST_PATH_IMAGE048
, fundamental frequency
Figure 2011102674255100002DEST_PATH_IMAGE050
And aperiodic componentThree groups of parameters;
Second step:Pair the conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, i.e.,
Figure 372818DEST_PATH_IMAGE048
Analyzed using convolution Non-negative Matrix Factorization, now fixing its time-frequency base is
Figure 2011102674255100002DEST_PATH_IMAGE054
, obtain corresponding encoder matrix
Figure 2011102674255100002DEST_PATH_IMAGE056
, and then the STRAIGHT spectrums after being changed by equation below:
Figure 2011102674255100002DEST_PATH_IMAGE058
Wherein
Figure 2011102674255100002DEST_PATH_IMAGE060
The STRAIGHT spectrums after conversion are represented, "
Figure 2011102674255100002DEST_PATH_IMAGE062
" it is convolution algorithm;
3rd step:The average and variance of the fundamental frequency obtained based on the training stage, realize the conversion of fundamental frequency:
Wherein
Figure 2011102674255100002DEST_PATH_IMAGE066
Represent the fundamental frequency after conversion;
4th step:Voice after synthesis conversion, i.e., composed by the STRAIGHT being converted to, fundamental frequency
Figure 256646DEST_PATH_IMAGE066
And original aperiodic component
Figure 651856DEST_PATH_IMAGE052
Voice after synthesis conversion.
The present invention compared with prior art, its remarkable advantage:(1)In the training stage, based on phoneme information, the matching of source speaker's voice and target speaker's voice is realized using pitch synchronous splicing adding method, makes the voice after matching that there is higher time match precision and voice quality, the training effect of voice conversion is improved;(2)Efficiently separating for personal characteristic information in vocal tract spectrum is realized by convolution non-negative matrix factorization method, transfer process is implemented for personal characteristic information, so as to improve the conversion effect of voice.In addition, convolution non-negative matrix factorization method effectively saves the relativity of time domain of vocal tract spectrum parameter, make reconstructed voice that there is more preferable continuity, improve the voice quality of conversion voice.
The present invention is described in further detail below in conjunction with the accompanying drawings.
Brief description of the drawings
Fig. 1 is the phonetics transfer method schematic diagram of the invention based on convolution Non-negative Matrix Factorization.
Fig. 2 is to carry out the time alignment based on phoneme for training voice to handle schematic diagram.
Fig. 3 is voice fundamental mark point schematic diagram.
Fig. 4 is the training voice time-frequency base calculation process schematic diagram based on convolution Non-negative Matrix Factorization.
Fig. 5 is the STRAIGHT time spectrum frequency base schematic diagrames being made up of 40 subbases.
Fig. 6 is the frequency spectrum flow path switch schematic diagram based on convolution Non-negative Matrix Factorization.
Embodiment
With reference to Fig. 1, the phonetics transfer method of the invention based on convolution Non-negative Matrix Factorization, step is as follows:
Training stage:Transformation model is trained by training data.
The first step, train the time alignment and parameter decomposition of speech data:
(1)The time alignment of speech data, as shown in Figure 2.The source speaker's voice concentrated first to training data
Figure 138332DEST_PATH_IMAGE002
With target speaker's voice
Figure 226374DEST_PATH_IMAGE004
, analyzed by STRAIGHT models, obtain the pitch of both each sampled points, i.e. pitch period envelope
Figure 431090DEST_PATH_IMAGE006
With
Figure 629990DEST_PATH_IMAGE008
Figure 2011102674255100002DEST_PATH_IMAGE068
Figure 2011102674255100002DEST_PATH_IMAGE070
Wherein
Figure 2011102674255100002DEST_PATH_IMAGE072
With
Figure 2011102674255100002DEST_PATH_IMAGE074
Source speaker's voice is represented respectively
Figure 344874DEST_PATH_IMAGE002
With target speaker's voiceIn the sampled point number that includes.
Pitch period herein is represented in sampled point number form, and round is carried out to fractional part.Because voiceless sound section and unvoiced segments are without obvious pitch period, therefore its pitch period is fixed as
Figure 2011102674255100002DEST_PATH_IMAGE076
, whereinFor speech sample frequency, "
Figure 2011102674255100002DEST_PATH_IMAGE080
" represent to take no more than
Figure 2011102674255100002DEST_PATH_IMAGE082
Maximum integer.Based on pitch contour, using pitch period length as frame length, to voice
Figure 607414DEST_PATH_IMAGE002
Carry out framing.With voice
Figure 805494DEST_PATH_IMAGE002
Exemplified by, framing step is as follows:
From the 1st sampled point of voiceStart, with the pitch period length corresponding to it
Figure 2011102674255100002DEST_PATH_IMAGE086
The first frame is determined for frame length
Figure 2011102674255100002DEST_PATH_IMAGE088
, wherein
Figure 2011102674255100002DEST_PATH_IMAGE090
.Afterwards with voice
Figure 2011102674255100002DEST_PATH_IMAGE092
Sampled point
Figure 2011102674255100002DEST_PATH_IMAGE094
For the second frame start position, the second frame is determined by frame length of the pitch period corresponding to it
Figure 2011102674255100002DEST_PATH_IMAGE096
, wherein
Figure 2011102674255100002DEST_PATH_IMAGE098
.By that analogy, to
Figure 2011102674255100002DEST_PATH_IMAGE100
Frame, the starting point of current speech is obtained based on upper framing result
Figure 2011102674255100002DEST_PATH_IMAGE102
, and with the pitch period length corresponding to it
Figure 2011102674255100002DEST_PATH_IMAGE104
The framing result of current speech is obtained for frame length
Figure 2011102674255100002DEST_PATH_IMAGE106
, wherein
Figure 2011102674255100002DEST_PATH_IMAGE108
.This process is repeated, until voice ending, if being obtained
Figure 2011102674255100002DEST_PATH_IMAGE110
Frame voice.
Complete after framing, centered on each frame speech centre point, with most long frame length
Figure 2011102674255100002DEST_PATH_IMAGE112
Built for length
Figure 2011102674255100002DEST_PATH_IMAGE114
Speech data matrix
Figure 2011102674255100002DEST_PATH_IMAGE116
, its is each to be classified as a frame voice, and passes through Hanning windows to every column data and carry out windowing process.When building matrix, respectively using voice starting and ending point polishing at curtailment that voice is originated and ended up.
To matrix
Figure 977282DEST_PATH_IMAGE116
Searched for by column, a point is determined in each row, so as to constitute the fundamental tone mark locus of points through each row
Figure 953328DEST_PATH_IMAGE010
, make each point value sum on track maximum.The row position difference of selected point is not more than 6 rows between limitation adjacent column in search procedure.It can obtain marking point for the fundamental tone of PSOLA processing by the method, these mark points are in Amplitude maxima position in voice voiced segments.The fundamental tone mark point schematic diagram that one section of voice is obtained by the above method is given in Fig. 3.Same method can arrive voice
Figure 930249DEST_PATH_IMAGE004
Fundamental tone mark point
Figure 573720DEST_PATH_IMAGE012
According to phoneme division information, the matching corresponding relation of fundamental tone mark point in source speaker and target speaker's phoneme of speech sound is set up:
Figure 2011102674255100002DEST_PATH_IMAGE118
.WhereinWith
Figure 2011102674255100002DEST_PATH_IMAGE122
The is represented in source speaker and target speaker's voice respectivelyThe fundamental tone mark point information that individual phoneme is included, concrete form is as follows:
Figure 2011102674255100002DEST_PATH_IMAGE126
Figure 2011102674255100002DEST_PATH_IMAGE128
Here,
Figure 2011102674255100002DEST_PATH_IMAGE130
With
Figure 2011102674255100002DEST_PATH_IMAGE132
Respectively in source speaker and target speaker's voice
Figure 101697DEST_PATH_IMAGE124
In individual phoneme
Figure 830618DEST_PATH_IMAGE100
With
Figure 2011102674255100002DEST_PATH_IMAGE134
Individual fundamental tone marks point.
Figure 2011102674255100002DEST_PATH_IMAGE136
With
Figure 2011102674255100002DEST_PATH_IMAGE138
Respectively
Figure 784799DEST_PATH_IMAGE124
The fundamental tone mark that both include in individual phoneme is counted out.
Based on training voice
Figure 718995DEST_PATH_IMAGE002
The fundamental tone mark point information of middle matching phoneme, the voice duration for realizing source speaker phoneme corresponding with target speaker using PSOLA methods aligns.The frame length of PSOLA processing is taken as the triple-length of the corresponding pitch period of current pitch mark point.In alignment procedure, on the basis of matching in phoneme duration compared with minor element, another phoneme is compressed by PSOLA methods and realizes alignment.Because PSOLA methods are the progress duration adjustment in units of pitch period, Adjustment precision only can guarantee that in a pitch period length, therefore the different information adjusted to current matching phoneme will be included in next matching phoneme duration alignment and handle.And then realize alignment by truncating mode for the unvoiced segments between phoneme in speech data.
By above-mentioned steps to voice
Figure 441280DEST_PATH_IMAGE002
Figure 261468DEST_PATH_IMAGE004
In after unvoiced segments are handled between each phoneme and phoneme, obtained source speaker's voice of elapsed time alignment
Figure 613952DEST_PATH_IMAGE014
With target speaker's voice
Figure 556501DEST_PATH_IMAGE016
(2)Speech parameter is decomposed.Training voice after being aligned for the elapsed time, parameter decomposition is carried out using STRAIGHT models.Source speaker's voice can be respectively obtained by decomposition
Figure 932118DEST_PATH_IMAGE014
With target speaker's voiceThree groups of parameters:
A) the STRAIGHT spectrums of vocal tract spectrum feature are characterized, it is two-dimensional matrix:
Figure 2011102674255100002DEST_PATH_IMAGE140
Each of which row represent the STRAIGHT spectrum of a frame voice, include altogether
Figure 2011102674255100002DEST_PATH_IMAGE142
Individual spectrum point, takes
Figure 2011102674255100002DEST_PATH_IMAGE144
.Whole section of voice is divided into
Figure 2011102674255100002DEST_PATH_IMAGE146
Frame is analyzed, at a distance of 10ms between each frame center's point.Here use
Figure 136889DEST_PATH_IMAGE020
Expression source speaker's voice STRAIGHT is composed,
Figure 984760DEST_PATH_IMAGE022
Represent target speaker's voice STRAIGHT spectrums;
B) fundamental frequency of voice is trained,, wherein
Figure 2011102674255100002DEST_PATH_IMAGE150
For voiceThe fundamental frequency of frame, its spectrum the with STRAIGHT
Figure 9665DEST_PATH_IMAGE134
Row are corresponding.Here use
Figure 336741DEST_PATH_IMAGE024
Expression source speaker's voice fundamental frequency,
Figure 621091DEST_PATH_IMAGE026
Represent target speaker's voice fundamental frequency;
C) aperiodic component, it influences smaller to voice in the transfer to characterize the matrix of the aperiodic information characteristic of phonological component, thus without conversion processing.
Second step, voice STRAIGHT spectrums are analyzed by convolution non-negative matrix factorization method, the time-frequency base of source speaker and target speaker's voice STRAIGHT spectrums are obtained, as shown in Figure 4.Analysis is comprised the following steps that:
(1)Source speaker STRAIGHT is composed using convolution non-negative matrix factorization method
Figure 345203DEST_PATH_IMAGE020
Analyzed, can obtain following decomposition result:
Figure 2011102674255100002DEST_PATH_IMAGE154
WhereinFor
Figure 226888DEST_PATH_IMAGE020
Time-frequency base, be specifically
Figure 2011102674255100002DEST_PATH_IMAGE156
Matrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, by
Figure 2011102674255100002DEST_PATH_IMAGE158
Individual such base vector constitutes a time-frequency base, is obtained
Figure 2011102674255100002DEST_PATH_IMAGE160
Individual such time-frequency base.Take
Figure 2011102674255100002DEST_PATH_IMAGE162
,
Figure 2011102674255100002DEST_PATH_IMAGE164
Figure 2011102674255100002DEST_PATH_IMAGE166
Represent to encoder matrixMoved to right with column vector form
Figure 2011102674255100002DEST_PATH_IMAGE170
Individual unit, concrete form is as follows:
If
Figure 2011102674255100002DEST_PATH_IMAGE172
Wherein
Figure DEST_PATH_IMAGE174
For encoder matrix
Figure 322887DEST_PATH_IMAGE034
Individual column vector, andIn include altogether
Figure DEST_PATH_IMAGE176
Individual column vector.Then when
Figure DEST_PATH_IMAGE178
When:
Figure DEST_PATH_IMAGE180
When
Figure DEST_PATH_IMAGE182
When:
Figure DEST_PATH_IMAGE184
Wherein "
Figure DEST_PATH_IMAGE186
" it is null value column vector.
Figure 11861DEST_PATH_IMAGE032
With
Figure 638014DEST_PATH_IMAGE034
Calculating obtained by following iterative process:
A) it is rightWith
Figure 67038DEST_PATH_IMAGE034
Carry out random initializtion;
B) following formula calculating pair is passed through
Figure 223213DEST_PATH_IMAGE020
Reconstruction result:
Figure DEST_PATH_IMAGE188
C) it is based on
Figure DEST_PATH_IMAGE190
To time-frequency baseIt is updated, renewal process is directed toCalculated successively:
Figure DEST_PATH_IMAGE194
Wherein "" represent two matrixes between element multiplication,It is all 1 to represent element
Figure DEST_PATH_IMAGE200
Matrix.
After the completion of time-frequency base updates, encoder matrix is updated by following formula:
Figure DEST_PATH_IMAGE202
D) judge whether iterations reaches maximum iteration 300 times, or speech reconstruction error is less than 10-5, reconstructed error determines by following formula:
Figure DEST_PATH_IMAGE204
When above-mentioned two condition is all unsatisfactory for, returns to step b) and continue iteration, iterative cycles are otherwise terminated, into next step e).
E) final decomposition result is obtained:With
Figure 431624DEST_PATH_IMAGE034
Fig. 5 is to decompose one section of obtained STRAIGHT time spectrum frequency base schematic diagram by the above method.
The time-frequency base obtained after decompositionA proper subspace of source speaker STRAIGHT spectrums is can be regarded as, the personal characteristic information of source speaker's vocal tract spectrum is carried, and encoder matrix
Figure 410262DEST_PATH_IMAGE034
It is then to compose in subspace
Figure 444077DEST_PATH_IMAGE054
On projection, carry the change of time-frequency base in time.The source speaker's voice and target speaker's voice concentrated due to training voice have passed through accurate time alignment, it is believed that both only have differences in speaker characteristic information, therefore after Non-negative Matrix Factorization, both encoder matrixs are identicals.
(2)Source speaker STRAIGHT is composed using convolution non-negative matrix factorization method
Figure 890102DEST_PATH_IMAGE022
Analyzed, analysis method with(1)Middle analysis method is identical, but now regular coding matrix is
Figure 958552DEST_PATH_IMAGE034
, then can obtain
Figure 97409DEST_PATH_IMAGE022
Time-frequency base
3rd step, analyze source voice and target voice pitch frequency parameter.Analyze and 1 rank of fundamental frequency and 2 rank statistics in source speaker and target speaker training voice are obtained by STRAIGHT models, i.e.,
Figure 117055DEST_PATH_IMAGE024
With
Figure 366771DEST_PATH_IMAGE026
Average and variance
Figure 289728DEST_PATH_IMAGE038
Figure 599486DEST_PATH_IMAGE040
With
Figure 670211DEST_PATH_IMAGE042
Figure 661300DEST_PATH_IMAGE044
The conversion stage,New input voice is changed based on training pattern.
The first step:To source speech data to be convertedParameter decomposition is carried out using STRAIGHT models(Parameter decomposition method is as the parameter decomposition method of training stage), obtain its STRAIGHT spectrums
Figure 981740DEST_PATH_IMAGE048
, fundamental frequencyAnd aperiodic component
Figure 68962DEST_PATH_IMAGE052
Three groups of parameters;
Second step:The conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, as shown in Figure 6.It is right first
Figure 763249DEST_PATH_IMAGE048
Analyzed using convolution Non-negative Matrix Factorization.In analysis method and training stage second step(1)Methods described is identical, but now sets time-frequency base and obtain as the training stage, so as to can obtain corresponding encoder matrix
Figure 896344DEST_PATH_IMAGE056
.Analyzed from before, speaker's personal characteristic information is carried in time-frequency base, thus when realizing conversion, usedReplace
Figure 778030DEST_PATH_IMAGE054
, afterwards and encoder matrixConvolution algorithm is carried out, the STRAIGHT spectrums after being changed are shown below:
Figure DEST_PATH_IMAGE208
Wherein
Figure 70788DEST_PATH_IMAGE060
The STRAIGHT spectrums after conversion are represented, "
Figure 4109DEST_PATH_IMAGE062
" it is process of convolution.
3rd step:Realize the conversion of fundamental frequency.For fundamental frequency to be converted, the source speaker obtained by the training stage and the average and variance of target speaker's fundamental frequency are realized according to following formula and changed:
Figure 869614DEST_PATH_IMAGE064
Wherein
Figure 358364DEST_PATH_IMAGE066
Represent the fundamental frequency after conversion.
4th step:Voice after synthesis conversion.Use the STRAIGHT spectrums after conversion
Figure 797173DEST_PATH_IMAGE060
, fundamental frequency after conversion
Figure 320558DEST_PATH_IMAGE066
, and the aperiodic component obtained during signal decomposition, according to STRAIGHT Model voice composition algorithms, you can the speech data after being changed:
Figure DEST_PATH_IMAGE210
Wherein
Figure DEST_PATH_IMAGE212
STRAIGHT Speech Synthesis Algorithms are represented,
Figure DEST_PATH_IMAGE214
For the speech data after conversion.

Claims (4)

1. a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization, it is characterised in that step is as follows:
First, transformation model is trained by training data:
The first step:The time alignment and parameter decomposition of speech data are trained, for the parallel speech data used in training, i.e. source speaker and the voice pair of the identical content of target speaker, wherein source speaker voice is represented by
Figure 2011102674255100001DEST_PATH_IMAGE002
, target speaker's voice is represented by
Figure 2011102674255100001DEST_PATH_IMAGE004
, pass through the pitch period envelope of both STRAIGHT model extractions first
Figure 2011102674255100001DEST_PATH_IMAGE006
With
Figure 2011102674255100001DEST_PATH_IMAGE008
, calculated afterwards by pitch period envelope and primary speech signal for realizing that the fundamental tone that pitch synchronous splicing adding is handled marks point
Figure 2011102674255100001DEST_PATH_IMAGE010
With;According to phoneme division information, with voice
Figure 198879DEST_PATH_IMAGE002
Corresponding phoneme carry out fundamental tone mark Point matching for unit, afterwards again using phoneme as elementary cell, voice is realized using pitch synchronous splicing adding mode based on matching fundamental tone mark point
Figure 833440DEST_PATH_IMAGE002
With
Figure 228649DEST_PATH_IMAGE004
Time alignment, obtain the voice after time alignment
Figure 2011102674255100001DEST_PATH_IMAGE014
With
Figure 2011102674255100001DEST_PATH_IMAGE016
, use STRAIGHT models pair
Figure 2011102674255100001DEST_PATH_IMAGE018
With
Figure 416923DEST_PATH_IMAGE016
Analyzed, obtain three groups of parameters:
(1)Characterize the STRAIGHT spectrums of tract characteristics
Figure 2011102674255100001DEST_PATH_IMAGE020
Figure 2011102674255100001DEST_PATH_IMAGE022
(2)Fundamental frequency
(3)Aperiodic component
Figure DEST_PATH_IMAGE028
Figure DEST_PATH_IMAGE030
; 
Second step:STRAIGHT spectrums are analyzed using convolution non-negative matrix factorization method, i.e., it is right first
Figure 369879DEST_PATH_IMAGE014
STRAIGHT spectrum
Figure 73130DEST_PATH_IMAGE020
Analyzed using convolution non-negative matrix factorization method, obtain its time-frequency base
Figure DEST_PATH_IMAGE032
And encoder matrix
Figure DEST_PATH_IMAGE034
, pass through convolution Non-negative Matrix Factorization mode pair again afterwards
Figure 147397DEST_PATH_IMAGE016
STRAIGHT spectrum
Figure 816275DEST_PATH_IMAGE022
Analyzed, now fixing its encoder matrix is
Figure 747322DEST_PATH_IMAGE034
, then can obtain its time-frequency base
Figure DEST_PATH_IMAGE036
3rd step:The fundamental frequency of analysis source voice and target voice, i.e., by pair
Figure 937870DEST_PATH_IMAGE014
With
Figure 612565DEST_PATH_IMAGE016
Fundamental frequency information
Figure 135950DEST_PATH_IMAGE024
With
Figure 565794DEST_PATH_IMAGE026
Analyzed, obtain the average and variance of both:
Figure DEST_PATH_IMAGE040
With
Secondly, new input voice is changed based on training pattern:
The first step:For source speech data to be convertedParameter decomposition is carried out using STRAIGHT models, its STRAIGHT spectrums are obtained, fundamental frequency
Figure DEST_PATH_IMAGE050
And aperiodic componentThree groups of parameters;
Second step:Pair the conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, i.e.,Analyzed using convolution Non-negative Matrix Factorization, now fixing its time-frequency base is
Figure DEST_PATH_IMAGE054
, obtain corresponding encoder matrix
Figure DEST_PATH_IMAGE056
, and then the STRAIGHT spectrums after being changed by equation below:
Wherein
Figure DEST_PATH_IMAGE060
The STRAIGHT spectrums after conversion are represented, "" it is convolution algorithm;
3rd step:The average and variance of the fundamental frequency obtained based on the training stage, realize the conversion of fundamental frequency:
Figure DEST_PATH_IMAGE064
Wherein
Figure DEST_PATH_IMAGE066
Represent the fundamental frequency after conversion;
4th step:Voice after synthesis conversion, i.e., composed by the STRAIGHT being converted to
Figure 474023DEST_PATH_IMAGE060
, fundamental frequencyAnd original aperiodic componentVoice after synthesis conversion.
2. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterised in that based on pitch contour, using pitch period length as frame length, to voice
Figure 555483DEST_PATH_IMAGE002
Carry out time alignment:
1)The framing stage
From the 1st sampled point of voice
Figure DEST_PATH_IMAGE068
Start, with the pitch period length corresponding to it
Figure DEST_PATH_IMAGE070
The first frame is determined for frame length
Figure DEST_PATH_IMAGE072
, wherein, afterwards with voice
Figure DEST_PATH_IMAGE076
Sampled point
Figure DEST_PATH_IMAGE078
For the second frame start position, the second frame is determined by frame length of the pitch period corresponding to it
Figure DEST_PATH_IMAGE080
, wherein
Figure DEST_PATH_IMAGE082
, by that analogy, to
Figure DEST_PATH_IMAGE084
Frame, the starting point of current speech is obtained based on upper framing result
Figure DEST_PATH_IMAGE086
, and with the pitch period length corresponding to it
Figure DEST_PATH_IMAGE088
The framing result of current speech is obtained for frame length
Figure DEST_PATH_IMAGE090
, wherein
Figure DEST_PATH_IMAGE092
, this process is repeated, until voice ending, if being obtained
Figure DEST_PATH_IMAGE094
Frame voice;
Complete after framing, centered on each frame speech centre point, with most long frame length
Figure DEST_PATH_IMAGE096
Built for length
Figure DEST_PATH_IMAGE098
Speech data matrix
Figure DEST_PATH_IMAGE100
, its is each to be classified as a frame voice, and carries out windowing process by Hanning windows to every column data, when building matrix, respectively using voice starting and ending point polishing at curtailment that voice is originated and ended up;
To matrix
Figure 506176DEST_PATH_IMAGE100
Searched for by column, a point is determined in each row, so as to constitute the fundamental tone mark locus of points through each row
Figure 277823DEST_PATH_IMAGE010
Make each point value sum on track maximum, the row position difference of selected point is not more than 6 rows between limitation adjacent column in search procedure, obtain marking point for the fundamental tone of pitch synchronous splicing adding processing by the method, these mark points are in Amplitude maxima position in voice voiced segments, and same method can arrive voice
Figure 228462DEST_PATH_IMAGE004
Fundamental tone mark point
2)Matching stage
According to phoneme division information, the matching corresponding relation of fundamental tone mark point in source speaker and target speaker's phoneme of speech sound is set up:
Figure DEST_PATH_IMAGE102
, wherein
Figure DEST_PATH_IMAGE104
With
Figure DEST_PATH_IMAGE106
The is represented in source speaker and target speaker's voice respectively
Figure DEST_PATH_IMAGE108
The fundamental tone mark point information that individual phoneme is included, concrete form is as follows:
Figure DEST_PATH_IMAGE110
Figure DEST_PATH_IMAGE112
Here,
Figure DEST_PATH_IMAGE114
With
Figure DEST_PATH_IMAGE116
Respectively in source speaker and target speaker's voice
Figure 154796DEST_PATH_IMAGE108
In individual phoneme
Figure 35028DEST_PATH_IMAGE084
WithIndividual fundamental tone marks point,
Figure DEST_PATH_IMAGE120
With
Figure DEST_PATH_IMAGE122
RespectivelyThe fundamental tone mark that both include in individual phoneme is counted out;
3)Alignment stage
Based on training voiceThe fundamental tone mark point information of middle matching phoneme, the voice duration alignment of source speaker phoneme corresponding with target speaker is realized using pitch synchronous splicing adding method, the frame length of pitch synchronous splicing adding processing is taken as the triple-length of the corresponding pitch period of current pitch mark point;In alignment procedure, on the basis of matching in phoneme duration compared with minor element, another phoneme is compressed by pitch synchronous splicing adding method and realizes alignment;Because PSOLA methods are the progress duration adjustment in units of pitch period, Adjustment precision only can guarantee that in a pitch period length, therefore the different information that current matching phoneme is adjusted will be included in next matching phoneme duration alignment and handled, and then alignment is realized by truncating mode for the unvoiced segments between phoneme in speech data;
By above-mentioned steps to voice
Figure 791183DEST_PATH_IMAGE002
Figure 716414DEST_PATH_IMAGE004
In after unvoiced segments are handled between each phoneme and phoneme, obtained source speaker's voice of elapsed time alignment
Figure 878405DEST_PATH_IMAGE014
With target speaker's voice
Figure 939902DEST_PATH_IMAGE016
3. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterized in that the speech parameter of training stage is decomposed, training voice after being aligned for the elapsed time, carries out parameter decomposition using STRAIGHT models, source speaker's voice is respectively obtained by decomposition
Figure 224252DEST_PATH_IMAGE014
With target speaker's voice
Figure 574462DEST_PATH_IMAGE016
Three groups of parameters:
1)The STRAIGHT spectrums of vocal tract spectrum feature are characterized, it is two-dimensional matrix:
Figure DEST_PATH_IMAGE124
Each of which row represent the STRAIGHT spectrum of a frame voice, include altogether
Figure DEST_PATH_IMAGE126
Individual spectrum point, takes
Figure DEST_PATH_IMAGE128
, whole section of voice be divided into
Figure DEST_PATH_IMAGE130
Frame is analyzed, and at a distance of 10ms between each frame center's point, is used here
Figure 976362DEST_PATH_IMAGE020
Expression source speaker's voice STRAIGHT is composed,
Figure 95628DEST_PATH_IMAGE022
Represent target speaker's voice STRAIGHT spectrums;
2)The fundamental frequency of voice is trained,
Figure DEST_PATH_IMAGE132
, wherein
Figure DEST_PATH_IMAGE134
For voice
Figure 222984DEST_PATH_IMAGE118
The fundamental frequency of frame, its spectrum the with STRAIGHT
Figure 60490DEST_PATH_IMAGE118
Row are corresponding, use here
Figure 626601DEST_PATH_IMAGE024
Expression source speaker's voice fundamental frequency,
Figure 928269DEST_PATH_IMAGE026
Represent target speaker's voice fundamental frequency;
3)Aperiodic component
Figure DEST_PATH_IMAGE136
, it influences smaller to voice in the transfer to characterize the matrix of the aperiodic information characteristic of phonological component, thus without conversion processing.
4. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterised in that the analytical procedure of training stage is as follows:
1)Source speaker STRAIGHT is composed using convolution non-negative matrix factorization method
Figure 719202DEST_PATH_IMAGE020
Analyzed, obtain following decomposition result:
Figure DEST_PATH_IMAGE138
Wherein
Figure 44004DEST_PATH_IMAGE032
ForTime-frequency base, be specifically
Figure DEST_PATH_IMAGE140
Matrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, by
Figure DEST_PATH_IMAGE142
Individual such base vector constitutes a time-frequency base, is obtained
Figure DEST_PATH_IMAGE144
Individual such time-frequency base, takes
Figure DEST_PATH_IMAGE146
,
Figure DEST_PATH_IMAGE148
,
Figure DEST_PATH_IMAGE150
Represent to encoder matrix
Figure DEST_PATH_IMAGE152
Moved to right with column vector form
Figure DEST_PATH_IMAGE154
Individual unit, concrete form is as follows:
If
Wherein
Figure DEST_PATH_IMAGE158
For encoder matrix
Figure 865253DEST_PATH_IMAGE034
Figure 662308DEST_PATH_IMAGE084
Individual column vector, and
Figure 271144DEST_PATH_IMAGE034
In include altogether
Figure DEST_PATH_IMAGE160
Individual column vector, then when
Figure DEST_PATH_IMAGE162
When:
Figure DEST_PATH_IMAGE164
When
Figure DEST_PATH_IMAGE166
When:
Wherein "
Figure DEST_PATH_IMAGE170
" it is null value column vector;
2)Source speaker STRAIGHT is composed using convolution non-negative matrix factorization method
Figure 569270DEST_PATH_IMAGE022
Analyzed, analysis method and 1)Middle analysis method is identical, but now regular coding matrix is
Figure 517635DEST_PATH_IMAGE034
, then can obtain
Figure 927668DEST_PATH_IMAGE022
Time-frequency base
Figure DEST_PATH_IMAGE172
CN201110267425A 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization Expired - Fee Related CN102306492B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110267425A CN102306492B (en) 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110267425A CN102306492B (en) 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization

Publications (2)

Publication Number Publication Date
CN102306492A true CN102306492A (en) 2012-01-04
CN102306492B CN102306492B (en) 2012-09-12

Family

ID=45380342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110267425A Expired - Fee Related CN102306492B (en) 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization

Country Status (1)

Country Link
CN (1) CN102306492B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610236A (en) * 2012-02-29 2012-07-25 山东大学 Method for improving voice quality of throat microphone
CN102855884A (en) * 2012-09-11 2013-01-02 中国人民解放军理工大学 Speech time scale modification method based on short-term continuous nonnegative matrix decomposition
CN103020017A (en) * 2012-12-05 2013-04-03 湖州师范学院 Non-negative matrix factorization method of popular regularization and authentication information maximization
CN105206259A (en) * 2015-11-03 2015-12-30 常州工学院 Voice conversion method
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105930308A (en) * 2016-04-14 2016-09-07 中国科学院西安光学精密机械研究所 Nonnegative matrix factorization method based on low-rank recovery
CN107221321A (en) * 2017-03-27 2017-09-29 杭州电子科技大学 A kind of phonetics transfer method being used between any source and target voice
CN107464569A (en) * 2017-07-04 2017-12-12 清华大学 Vocoder
CN107785030A (en) * 2017-10-18 2018-03-09 杭州电子科技大学 A kind of phonetics transfer method
US10055479B2 (en) 2015-01-12 2018-08-21 Xerox Corporation Joint approach to feature and document labeling
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN110148424A (en) * 2019-05-08 2019-08-20 北京达佳互联信息技术有限公司 Method of speech processing, device, electronic equipment and storage medium
CN111899716A (en) * 2020-08-03 2020-11-06 北京帝派智能科技有限公司 Speech synthesis method and system
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278282A (en) * 2004-12-23 2008-10-01 剑桥显示技术公司 Digital signal processing methods and apparatus
CN101441872A (en) * 2007-11-19 2009-05-27 三菱电机株式会社 Denoising acoustic signals using constrained non-negative matrix factorization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278282A (en) * 2004-12-23 2008-10-01 剑桥显示技术公司 Digital signal processing methods and apparatus
CN101441872A (en) * 2007-11-19 2009-05-27 三菱电机株式会社 Denoising acoustic signals using constrained non-negative matrix factorization

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《2009 Fifth International Conference on Natural Computation》 20091231 Zhang ye etal. Blind Separation of Convolutive Mixed Source Singnals by Using Robust Nonnegative Matrix Factorizaion , *
《信号处理》 20070831 闵刚 等 分段声码器中的语音分段算法研究 第23卷, 第4A期 *
《计算机工程与设计》 20110131 刘伯权 等 采用非负矩阵分解的语音盲分离 , 第1期 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610236A (en) * 2012-02-29 2012-07-25 山东大学 Method for improving voice quality of throat microphone
CN102855884A (en) * 2012-09-11 2013-01-02 中国人民解放军理工大学 Speech time scale modification method based on short-term continuous nonnegative matrix decomposition
CN102855884B (en) * 2012-09-11 2014-08-13 中国人民解放军理工大学 Speech time scale modification method based on short-term continuous nonnegative matrix decomposition
CN103020017A (en) * 2012-12-05 2013-04-03 湖州师范学院 Non-negative matrix factorization method of popular regularization and authentication information maximization
US10055479B2 (en) 2015-01-12 2018-08-21 Xerox Corporation Joint approach to feature and document labeling
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105206257B (en) * 2015-10-14 2019-01-18 科大讯飞股份有限公司 A kind of sound converting method and device
CN105206259A (en) * 2015-11-03 2015-12-30 常州工学院 Voice conversion method
CN105930308A (en) * 2016-04-14 2016-09-07 中国科学院西安光学精密机械研究所 Nonnegative matrix factorization method based on low-rank recovery
CN105930308B (en) * 2016-04-14 2019-01-15 中国科学院西安光学精密机械研究所 The non-negative matrix factorization method restored based on low-rank
CN107221321A (en) * 2017-03-27 2017-09-29 杭州电子科技大学 A kind of phonetics transfer method being used between any source and target voice
CN107464569A (en) * 2017-07-04 2017-12-12 清华大学 Vocoder
CN107785030A (en) * 2017-10-18 2018-03-09 杭州电子科技大学 A kind of phonetics transfer method
CN107785030B (en) * 2017-10-18 2021-04-30 杭州电子科技大学 Voice conversion method
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN109767778B (en) * 2018-12-27 2020-07-31 中国人民解放军陆军工程大学 Bi-L STM and WaveNet fused voice conversion method
CN110148424A (en) * 2019-05-08 2019-08-20 北京达佳互联信息技术有限公司 Method of speech processing, device, electronic equipment and storage medium
CN111899716A (en) * 2020-08-03 2020-11-06 北京帝派智能科技有限公司 Speech synthesis method and system
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function

Also Published As

Publication number Publication date
CN102306492B (en) 2012-09-12

Similar Documents

Publication Publication Date Title
CN102306492A (en) Voice conversion method based on convolutive nonnegative matrix factorization
Han et al. Semantic-preserved communication system for highly efficient speech transmission
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
CN109147758A (en) A kind of speaker's sound converting method and device
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN108847249A (en) Sound converts optimization method and system
CN105654939B (en) A kind of phoneme synthesizing method based on sound vector text feature
CN105869624A (en) Method and apparatus for constructing speech decoding network in digital speech recognition
CN109272992A (en) A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model
CN103035236B (en) High-quality voice conversion method based on modeling of signal timing characteristics
CN104272382A (en) Method and system for template-based personalized singing synthesis
CN110322900A (en) A kind of method of phonic signal character fusion
CN111128211B (en) Voice separation method and device
Cooper et al. Can speaker augmentation improve multi-speaker end-to-end TTS?
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
Denisov et al. End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning
CN103413548B (en) A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
CN111488486B (en) Electronic music classification method and system based on multi-sound-source separation
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
Chen et al. An investigation of implementation and performance analysis of DNN based speech synthesis system
CN107785030A (en) A kind of phonetics transfer method
Raju et al. Application of prosody modification for speech recognition in different emotion conditions
Asakawa et al. Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics
Quillen Autoregressive HMM speech synthesis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120912

Termination date: 20140909

EXPY Termination of patent right or utility model