CN102306492A

CN102306492A - Voice conversion method based on convolutive nonnegative matrix factorization

Info

Publication number: CN102306492A
Application number: CN201110267425A
Authority: CN
Inventors: 张雄伟; 孙健; 曹铁勇; 孙新建; 黄建军; 杨吉斌; 邹霞; 贾冲
Original assignee: PLA University of Science and Technology
Current assignee: PLA University of Science and Technology
Priority date: 2011-09-09
Filing date: 2011-09-09
Publication date: 2012-01-04
Anticipated expiration: 2031-09-09
Also published as: CN102306492B

Abstract

The invention discloses a voice conversion method based on convolutive nonnegative matrix factorization. The method comprises the following steps: (1) training a transformation model through training data: carrying out time calibration and parameter decomposition of training voice data, analyzing STRAIGHT spectrum by using a convolutive nonnegative matrix factorization method, and analyzing pitch frequency of source voice and object voice; (2) converting new input voice based on a training model: carrying out parameter decomposition on source voice data A[c] to be converted by employing a STRAIGHT model, realizing sound channel frequency spectrum parameter conversion based on convolutive nonnegative matrix factorization, realizing conversion of the pitch frequency based on obtained mean value and variance in a training phase, and synthesizing voice after conversion, wherein the voice is voice after synthesis and conversion of the STRAIGHT spectrum S[Bc] which is obtained through conversion, the pitch frequency f[Bc] and original aperiodic component ap[Ac]. According to the invention, training effect of voice conversion is improved, and voice quality of conversion voice is improved.

Description

Phonetics transfer method based on convolution Non-negative Matrix Factorization

Technical field

The invention belongs to voice process technology field, particularly a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization.

Background technology

Voice conversion is a kind of personal characteristic information by changing in the speaker's voice signal of source, with the technology of target speaker's vocie personal characteristic information.Voice conversion suffers from wide application prospect in personalized man-machine interaction, military struggle, information security and multimedia recreation field.For example, being combined by digitizing the speech into speech synthesis system, you can realize that personalized speech is synthesized；Changed by voice, enemy commander's sound can be forged to send disinformation or order, enemy's operational commanding is upset；Given a lecture by the reproducible historical personage of voice conversion etc..

Voice is changed（Voice Conversion/Transformation）The technical research history of 20 years so far（Li Bo, Wang Chengyou, Cai Xuanping wait voices to change and correlation technique summary [J] communication journals, 2004 (05):109-118.）, its earliest method is to be proposed by Abe et al. in 1988.Existing phonetics transfer method mainly includes：The method mapped based on vector quantization yardage（1. M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, "Voice conversion through vector quantization," ICASSP-88., 1988, pp. 655-658.）, method based on gauss hybrid models（2. Y. Stylianou, O. Cappe and E. Moulines, "Continuous probabilistic transform for voice conversion," Speech and Audio Processing, IEEE Transactions on, vol. 6, pp. 131-142, 1998.）, method based on HMM（3. E. K. Kim, S. Lee and Y. H. Oh, "Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker," in Proc. Eurospeech, Rhodes, Greece, 1997, pp. 2519-2522.）, method based on frequency warping（4. D. Erro and A. Moreno, "Weighted Frequency Warping for Voice Conversion," in InterSpeech 2007 - EuroSpeech Antwerp, Belgium, 2007.）With the method based on artificial neural network（5. S. Desai, A. Black, B. Yegnanarayana, and K. Prahallad, "Spectral Mapping Using Artificial Neural Networks for Voice Conversion," Audio, Speech, and Language Processing, IEEE Transactions on, vol. PP, p. 1-1, 2010.）.

Although proposing that the effect of voice conversion also reaches far away practical requirement for the existing a variety of methods of voice conversion.The problem of existing voice conversion method is present mainly has：

1. many phonetics transfer methods are built upon to after voice signal framing, under the framework of each frame independent process.Under this framework, the correlation of voice interframe is often ignored, so as to cause voice after conversion discontinuity occur and show, reduces the quality of voice after conversion.Method, the method based on gauss hybrid models and the method based on artificial neural network for example mapped based on vector quantization yardage；

2. the target of voice conversion is the correct speaker's personal characteristic information changed in voice, and existing voice conversion method is not separated the personal characteristic information of speaker's voice before conversion process is carried out from voice signal, but directly voice signal is handled.It so not only result in that conversion effect is unsatisfactory, simultaneously because changing other compositions in voice signal, cause the decline of voice quality after conversion.

Convolution Non-negative Matrix Factorization（Convolutive Nonnegative Matrix Factorization）A kind of non-negative matrix factorization method proposed for Speech processing, this method has used two-dimentional time-frequency base to replace original one-dimensional base vector, so as to preferably carry the timing dependence of voice signal on the premise of decomposition result nonnegativity is ensured.This method has more successful application in the separation for speaking human speech sound more（6. S. Paris, "Convolutive Speech Bases and Their Application to Supervised Speech Separation," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, pp. 1-12, 2007-01-01 2007.）.Voice signal can be decomposed into the encoder matrix of one group of non-negative time-frequency base and this group of base by this method.The sub-spaces that obtained time-frequency base is regarded as carrying speaker characteristic are decomposed, and encoder matrix is then projection of the voice in each sub-spaces.Therefore the function that the speaker characteristic information in voice signal is separated from voice signal has been largely fulfilled by this isolation.Further, since convolution Non-negative Matrix Factorization can preferably take the timing dependence of voice signal into account relative to traditional Non-negative Matrix Factorization, so as to ensure the continuity of reconstructed voice.

But this method has the problem of decomposition result is not unique, i.e., obtained basic matrix is decomposed to same speech data under different primary condition not unique.Although this different expression form that not can be regarded as feature space uniquely, its application in voice conversion is limited.

The content of the invention

It is an object of the invention to provide a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization.The separation to personal characteristic information in speech channel frequency spectrum is realized using convolution Non-negative Matrix Factorization, and the correlation of voice sequential is effectively saved in separation process, on the premise of source speaker's voice and target speaker's voice spectrum convolution Non-negative Matrix Factorization result uniformity is ensured, the conversion for completing speech channel frequency spectrum is replaced by time-frequency base.And furthermore achieved that voice is changed on this basis, make the voice after conversion that there is higher quality and there is stronger similarity with target speaker in vocie personal feature.

The technical solution for realizing the object of the invention is：A kind of phonetics transfer method based on convolution Non-negative Matrix Factorization, step is as follows：

First, transformation model is trained by training data：

The first step：The time alignment and parameter decomposition of speech data are trained, for the parallel speech data used in training, i.e. source speaker and the voice pair of the identical content of target speaker, wherein source speaker voice is represented by

Figure 2011102674255100002DEST_PATH_IMAGE002

, target speaker's voice is represented by

Figure 2011102674255100002DEST_PATH_IMAGE004

, pass through the pitch period envelope of both STRAIGHT model extractions first

Figure 2011102674255100002DEST_PATH_IMAGE006

With

Figure 2011102674255100002DEST_PATH_IMAGE008

, calculated afterwards by pitch period envelope and primary speech signal for realizing that the fundamental tone that pitch synchronous splicing adding is handled marks point

Figure 2011102674255100002DEST_PATH_IMAGE010

With

Figure 2011102674255100002DEST_PATH_IMAGE012

；According to phoneme division information, with voice

、

Corresponding phoneme carry out fundamental tone mark Point matching for unit, afterwards again using phoneme as elementary cell, voice is realized using pitch synchronous splicing adding mode based on matching fundamental tone mark point

With

Time alignment, obtain the voice after time alignment

Figure 2011102674255100002DEST_PATH_IMAGE014

With, use STRAIGHT models pair

Figure 2011102674255100002DEST_PATH_IMAGE018

With

Analyzed, obtain three groups of parameters：

（1）Characterize the STRAIGHT spectrums of tract characteristics

Figure 2011102674255100002DEST_PATH_IMAGE020

、

Figure 2011102674255100002DEST_PATH_IMAGE022

；

（2）Fundamental frequency

Figure 2011102674255100002DEST_PATH_IMAGE024

、；

（3）Aperiodic component

Figure 2011102674255100002DEST_PATH_IMAGE028

、

Figure 2011102674255100002DEST_PATH_IMAGE030

；

Second step：STRAIGHT spectrums are analyzed using convolution non-negative matrix factorization method, i.e., it is right first

STRAIGHT spectrum

Analyzed using convolution non-negative matrix factorization method, obtain its time-frequency base

Figure 2011102674255100002DEST_PATH_IMAGE032

And encoder matrix

Figure 2011102674255100002DEST_PATH_IMAGE034

, pass through convolution Non-negative Matrix Factorization mode pair again afterwardsSTRAIGHT spectrum

Analyzed, now fixing its encoder matrix is

, then can obtain its time-frequency base

Figure 2011102674255100002DEST_PATH_IMAGE036

；

3rd step：The fundamental frequency of analysis source voice and target voice, i.e., by pairWith

Fundamental frequency information

With

Analyzed, obtain the average and variance of both：

Figure 2011102674255100002DEST_PATH_IMAGE038

、With

Figure 2011102674255100002DEST_PATH_IMAGE042

、

Figure 2011102674255100002DEST_PATH_IMAGE044

；

Secondly, new input voice is changed based on training pattern：

The first step：For source speech data to be converted

Figure 2011102674255100002DEST_PATH_IMAGE046

Parameter decomposition is carried out using STRAIGHT models, its STRAIGHT spectrums are obtained

Figure 2011102674255100002DEST_PATH_IMAGE048

, fundamental frequency

Figure 2011102674255100002DEST_PATH_IMAGE050

And aperiodic componentThree groups of parameters；

Second step：Pair the conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, i.e.,

Analyzed using convolution Non-negative Matrix Factorization, now fixing its time-frequency base is

Figure 2011102674255100002DEST_PATH_IMAGE054

, obtain corresponding encoder matrix

Figure 2011102674255100002DEST_PATH_IMAGE056

, and then the STRAIGHT spectrums after being changed by equation below：

Figure 2011102674255100002DEST_PATH_IMAGE058

Wherein

Figure 2011102674255100002DEST_PATH_IMAGE060

The STRAIGHT spectrums after conversion are represented, "

Figure 2011102674255100002DEST_PATH_IMAGE062

" it is convolution algorithm；

3rd step：The average and variance of the fundamental frequency obtained based on the training stage, realize the conversion of fundamental frequency：

Wherein

Figure 2011102674255100002DEST_PATH_IMAGE066

Represent the fundamental frequency after conversion；

4th step：Voice after synthesis conversion, i.e., composed by the STRAIGHT being converted to, fundamental frequency

And original aperiodic component

Voice after synthesis conversion.

The present invention compared with prior art, its remarkable advantage：（1）In the training stage, based on phoneme information, the matching of source speaker's voice and target speaker's voice is realized using pitch synchronous splicing adding method, makes the voice after matching that there is higher time match precision and voice quality, the training effect of voice conversion is improved；（2）Efficiently separating for personal characteristic information in vocal tract spectrum is realized by convolution non-negative matrix factorization method, transfer process is implemented for personal characteristic information, so as to improve the conversion effect of voice.In addition, convolution non-negative matrix factorization method effectively saves the relativity of time domain of vocal tract spectrum parameter, make reconstructed voice that there is more preferable continuity, improve the voice quality of conversion voice.

The present invention is described in further detail below in conjunction with the accompanying drawings.

Brief description of the drawings

Fig. 1 is the phonetics transfer method schematic diagram of the invention based on convolution Non-negative Matrix Factorization.

Fig. 2 is to carry out the time alignment based on phoneme for training voice to handle schematic diagram.

Fig. 3 is voice fundamental mark point schematic diagram.

Fig. 4 is the training voice time-frequency base calculation process schematic diagram based on convolution Non-negative Matrix Factorization.

Fig. 5 is the STRAIGHT time spectrum frequency base schematic diagrames being made up of 40 subbases.

Fig. 6 is the frequency spectrum flow path switch schematic diagram based on convolution Non-negative Matrix Factorization.

Embodiment

With reference to Fig. 1, the phonetics transfer method of the invention based on convolution Non-negative Matrix Factorization, step is as follows：

Training stage：Transformation model is trained by training data.

The first step, train the time alignment and parameter decomposition of speech data：

（1）The time alignment of speech data, as shown in Figure 2.The source speaker's voice concentrated first to training data

With target speaker's voice

, analyzed by STRAIGHT models, obtain the pitch of both each sampled points, i.e. pitch period envelope

With

：

Figure 2011102674255100002DEST_PATH_IMAGE068

Figure 2011102674255100002DEST_PATH_IMAGE070

Wherein

Figure 2011102674255100002DEST_PATH_IMAGE072

With

Figure 2011102674255100002DEST_PATH_IMAGE074

Source speaker's voice is represented respectively

With target speaker's voiceIn the sampled point number that includes.

Pitch period herein is represented in sampled point number form, and round is carried out to fractional part.Because voiceless sound section and unvoiced segments are without obvious pitch period, therefore its pitch period is fixed as

Figure 2011102674255100002DEST_PATH_IMAGE076

, whereinFor speech sample frequency, "

Figure 2011102674255100002DEST_PATH_IMAGE080

" represent to take no more than

Maximum integer.Based on pitch contour, using pitch period length as frame length, to voice

、Carry out framing.With voice

Exemplified by, framing step is as follows：

From the 1st sampled point of voiceStart, with the pitch period length corresponding to it

Figure 2011102674255100002DEST_PATH_IMAGE086

The first frame is determined for frame length

Figure 2011102674255100002DEST_PATH_IMAGE088

, wherein

Figure 2011102674255100002DEST_PATH_IMAGE090

.Afterwards with voice

Figure 2011102674255100002DEST_PATH_IMAGE092

Sampled point

Figure 2011102674255100002DEST_PATH_IMAGE094

For the second frame start position, the second frame is determined by frame length of the pitch period corresponding to it

Figure 2011102674255100002DEST_PATH_IMAGE096

, wherein

Figure 2011102674255100002DEST_PATH_IMAGE098

.By that analogy, to

Frame, the starting point of current speech is obtained based on upper framing result

Figure 2011102674255100002DEST_PATH_IMAGE102

, and with the pitch period length corresponding to it

Figure 2011102674255100002DEST_PATH_IMAGE104

The framing result of current speech is obtained for frame length

Figure 2011102674255100002DEST_PATH_IMAGE106

, wherein

Figure 2011102674255100002DEST_PATH_IMAGE108

.This process is repeated, until voice ending, if being obtained

Figure 2011102674255100002DEST_PATH_IMAGE110

Frame voice.

Complete after framing, centered on each frame speech centre point, with most long frame length

Figure 2011102674255100002DEST_PATH_IMAGE112

Built for length

Figure 2011102674255100002DEST_PATH_IMAGE114

Speech data matrix

Figure 2011102674255100002DEST_PATH_IMAGE116

, its is each to be classified as a frame voice, and passes through Hanning windows to every column data and carry out windowing process.When building matrix, respectively using voice starting and ending point polishing at curtailment that voice is originated and ended up.

To matrix

Searched for by column, a point is determined in each row, so as to constitute the fundamental tone mark locus of points through each row

, make each point value sum on track maximum.The row position difference of selected point is not more than 6 rows between limitation adjacent column in search procedure.It can obtain marking point for the fundamental tone of PSOLA processing by the method, these mark points are in Amplitude maxima position in voice voiced segments.The fundamental tone mark point schematic diagram that one section of voice is obtained by the above method is given in Fig. 3.Same method can arrive voice

Fundamental tone mark point

。

According to phoneme division information, the matching corresponding relation of fundamental tone mark point in source speaker and target speaker's phoneme of speech sound is set up：

Figure 2011102674255100002DEST_PATH_IMAGE118

.WhereinWith

Figure 2011102674255100002DEST_PATH_IMAGE122

The is represented in source speaker and target speaker's voice respectivelyThe fundamental tone mark point information that individual phoneme is included, concrete form is as follows：

Figure 2011102674255100002DEST_PATH_IMAGE126

Figure 2011102674255100002DEST_PATH_IMAGE128

Here,

Figure 2011102674255100002DEST_PATH_IMAGE130

With

Figure 2011102674255100002DEST_PATH_IMAGE132

Respectively in source speaker and target speaker's voice

In individual phoneme

With

Individual fundamental tone marks point.

Figure 2011102674255100002DEST_PATH_IMAGE136

With

Figure 2011102674255100002DEST_PATH_IMAGE138

Respectively

The fundamental tone mark that both include in individual phoneme is counted out.

Based on training voice

、The fundamental tone mark point information of middle matching phoneme, the voice duration for realizing source speaker phoneme corresponding with target speaker using PSOLA methods aligns.The frame length of PSOLA processing is taken as the triple-length of the corresponding pitch period of current pitch mark point.In alignment procedure, on the basis of matching in phoneme duration compared with minor element, another phoneme is compressed by PSOLA methods and realizes alignment.Because PSOLA methods are the progress duration adjustment in units of pitch period, Adjustment precision only can guarantee that in a pitch period length, therefore the different information adjusted to current matching phoneme will be included in next matching phoneme duration alignment and handle.And then realize alignment by truncating mode for the unvoiced segments between phoneme in speech data.

By above-mentioned steps to voice

、

In after unvoiced segments are handled between each phoneme and phoneme, obtained source speaker's voice of elapsed time alignment

With target speaker's voice

。

（2）Speech parameter is decomposed.Training voice after being aligned for the elapsed time, parameter decomposition is carried out using STRAIGHT models.Source speaker's voice can be respectively obtained by decomposition

With target speaker's voiceThree groups of parameters：

A) the STRAIGHT spectrums of vocal tract spectrum feature are characterized, it is two-dimensional matrix：

Figure 2011102674255100002DEST_PATH_IMAGE140

Each of which row represent the STRAIGHT spectrum of a frame voice, include altogether

Figure 2011102674255100002DEST_PATH_IMAGE142

Individual spectrum point, takes

Figure 2011102674255100002DEST_PATH_IMAGE144

.Whole section of voice is divided into

Figure 2011102674255100002DEST_PATH_IMAGE146

Frame is analyzed, at a distance of 10ms between each frame center's point.Here use

Expression source speaker's voice STRAIGHT is composed,

Represent target speaker's voice STRAIGHT spectrums；

B) fundamental frequency of voice is trained,, wherein

Figure 2011102674255100002DEST_PATH_IMAGE150

For voiceThe fundamental frequency of frame, its spectrum the with STRAIGHT

Row are corresponding.Here use

Expression source speaker's voice fundamental frequency,

Represent target speaker's voice fundamental frequency；

C) aperiodic component, it influences smaller to voice in the transfer to characterize the matrix of the aperiodic information characteristic of phonological component, thus without conversion processing.

Second step, voice STRAIGHT spectrums are analyzed by convolution non-negative matrix factorization method, the time-frequency base of source speaker and target speaker's voice STRAIGHT spectrums are obtained, as shown in Figure 4.Analysis is comprised the following steps that：

（1）Source speaker STRAIGHT is composed using convolution non-negative matrix factorization method

Analyzed, can obtain following decomposition result：

Figure 2011102674255100002DEST_PATH_IMAGE154

WhereinFor

Time-frequency base, be specifically

Figure 2011102674255100002DEST_PATH_IMAGE156

Matrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, by

Figure 2011102674255100002DEST_PATH_IMAGE158

Individual such base vector constitutes a time-frequency base, is obtained

Individual such time-frequency base.Take

Figure 2011102674255100002DEST_PATH_IMAGE162

,

Figure 2011102674255100002DEST_PATH_IMAGE164

。

Figure 2011102674255100002DEST_PATH_IMAGE166

Represent to encoder matrixMoved to right with column vector form

Individual unit, concrete form is as follows：

If

Figure 2011102674255100002DEST_PATH_IMAGE172

Wherein

For encoder matrix

Individual column vector, andIn include altogether

Individual column vector.Then when

When：

When

When：

Wherein "

" it is null value column vector.

With

Calculating obtained by following iterative process：

A) it is rightWith

Carry out random initializtion；

B) following formula calculating pair is passed through

Reconstruction result：

C) it is based on

To time-frequency baseIt is updated, renewal process is directed toCalculated successively：

Wherein "" represent two matrixes between element multiplication,It is all 1 to represent element

Matrix.

After the completion of time-frequency base updates, encoder matrix is updated by following formula：

D) judge whether iterations reaches maximum iteration 300 times, or speech reconstruction error is less than 10^-5, reconstructed error determines by following formula：

When above-mentioned two condition is all unsatisfactory for, returns to step b) and continue iteration, iterative cycles are otherwise terminated, into next step e).

E) final decomposition result is obtained：With

Fig. 5 is to decompose one section of obtained STRAIGHT time spectrum frequency base schematic diagram by the above method.

The time-frequency base obtained after decompositionA proper subspace of source speaker STRAIGHT spectrums is can be regarded as, the personal characteristic information of source speaker's vocal tract spectrum is carried, and encoder matrix

It is then to compose in subspace

On projection, carry the change of time-frequency base in time.The source speaker's voice and target speaker's voice concentrated due to training voice have passed through accurate time alignment, it is believed that both only have differences in speaker characteristic information, therefore after Non-negative Matrix Factorization, both encoder matrixs are identicals.

（2）Source speaker STRAIGHT is composed using convolution non-negative matrix factorization method

Analyzed, analysis method with（1）Middle analysis method is identical, but now regular coding matrix is

, then can obtain

Time-frequency base。

3rd step, analyze source voice and target voice pitch frequency parameter.Analyze and 1 rank of fundamental frequency and 2 rank statistics in source speaker and target speaker training voice are obtained by STRAIGHT models, i.e.,

With

Average and variance

、

With

、

。

The conversion stage,New input voice is changed based on training pattern.

The first step：To source speech data to be convertedParameter decomposition is carried out using STRAIGHT models（Parameter decomposition method is as the parameter decomposition method of training stage）, obtain its STRAIGHT spectrums

, fundamental frequencyAnd aperiodic component

Three groups of parameters；

Second step：The conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, as shown in Figure 6.It is right first

Analyzed using convolution Non-negative Matrix Factorization.In analysis method and training stage second step（1）Methods described is identical, but now sets time-frequency base and obtain as the training stage, so as to can obtain corresponding encoder matrix

.Analyzed from before, speaker's personal characteristic information is carried in time-frequency base, thus when realizing conversion, usedReplace

, afterwards and encoder matrixConvolution algorithm is carried out, the STRAIGHT spectrums after being changed are shown below：

Wherein

The STRAIGHT spectrums after conversion are represented, "

" it is process of convolution.

3rd step：Realize the conversion of fundamental frequency.For fundamental frequency to be converted, the source speaker obtained by the training stage and the average and variance of target speaker's fundamental frequency are realized according to following formula and changed：

Wherein

Represent the fundamental frequency after conversion.

4th step：Voice after synthesis conversion.Use the STRAIGHT spectrums after conversion

, fundamental frequency after conversion

, and the aperiodic component obtained during signal decomposition, according to STRAIGHT Model voice composition algorithms, you can the speech data after being changed：

Wherein

STRAIGHT Speech Synthesis Algorithms are represented,

For the speech data after conversion.

Claims

1. a kind of phonetics transfer method based on convolution Non-negative Matrix Factorization, it is characterised in that step is as follows：

First, transformation model is trained by training data：

Figure 2011102674255100001DEST_PATH_IMAGE002

, target speaker's voice is represented by

Figure 2011102674255100001DEST_PATH_IMAGE004

Figure 2011102674255100001DEST_PATH_IMAGE006

With

Figure 2011102674255100001DEST_PATH_IMAGE008

Figure 2011102674255100001DEST_PATH_IMAGE010

With；According to phoneme division information, with voice

、Corresponding phoneme carry out fundamental tone mark Point matching for unit, afterwards again using phoneme as elementary cell, voice is realized using pitch synchronous splicing adding mode based on matching fundamental tone mark point

With

Time alignment, obtain the voice after time alignment

Figure 2011102674255100001DEST_PATH_IMAGE014

With

Figure 2011102674255100001DEST_PATH_IMAGE016

, use STRAIGHT models pair

Figure 2011102674255100001DEST_PATH_IMAGE018

With

Analyzed, obtain three groups of parameters：

（1）Characterize the STRAIGHT spectrums of tract characteristics

Figure 2011102674255100001DEST_PATH_IMAGE020

、

Figure 2011102674255100001DEST_PATH_IMAGE022

；

（2）Fundamental frequency、；

（3）Aperiodic component

、

；

STRAIGHT spectrum

And encoder matrix

, pass through convolution Non-negative Matrix Factorization mode pair again afterwards

STRAIGHT spectrum

Analyzed, now fixing its encoder matrix is

, then can obtain its time-frequency base

；

3rd step：The fundamental frequency of analysis source voice and target voice, i.e., by pair

With

Fundamental frequency information

With

Analyzed, obtain the average and variance of both：、

With、；

Secondly, new input voice is changed based on training pattern：

The first step：For source speech data to be convertedParameter decomposition is carried out using STRAIGHT models, its STRAIGHT spectrums are obtained, fundamental frequency

And aperiodic componentThree groups of parameters；

Second step：Pair the conversion of vocal tract spectrum parameter is realized based on convolution Non-negative Matrix Factorization, i.e.,Analyzed using convolution Non-negative Matrix Factorization, now fixing its time-frequency base is

, obtain corresponding encoder matrix

, and then the STRAIGHT spectrums after being changed by equation below：

Wherein

The STRAIGHT spectrums after conversion are represented, "" it is convolution algorithm；

Wherein

Represent the fundamental frequency after conversion；

4th step：Voice after synthesis conversion, i.e., composed by the STRAIGHT being converted to

, fundamental frequencyAnd original aperiodic componentVoice after synthesis conversion.

2. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterised in that based on pitch contour, using pitch period length as frame length, to voice

、Carry out time alignment：

1）The framing stage

From the 1st sampled point of voice

Start, with the pitch period length corresponding to it

The first frame is determined for frame length

, wherein, afterwards with voice

Sampled point

, wherein

, by that analogy, to

, and with the pitch period length corresponding to it

The framing result of current speech is obtained for frame length

, wherein

, this process is repeated, until voice ending, if being obtained

Frame voice；

Built for length

Speech data matrix

, its is each to be classified as a frame voice, and carries out windowing process by Hanning windows to every column data, when building matrix, respectively using voice starting and ending point polishing at curtailment that voice is originated and ended up；

To matrix

Make each point value sum on track maximum, the row position difference of selected point is not more than 6 rows between limitation adjacent column in search procedure, obtain marking point for the fundamental tone of pitch synchronous splicing adding processing by the method, these mark points are in Amplitude maxima position in voice voiced segments, and same method can arrive voice

Fundamental tone mark point；

2）Matching stage

, wherein

With

The is represented in source speaker and target speaker's voice respectively

The fundamental tone mark point information that individual phoneme is included, concrete form is as follows：

Here,

With

Respectively in source speaker and target speaker's voice

In individual phoneme

WithIndividual fundamental tone marks point,

With

RespectivelyThe fundamental tone mark that both include in individual phoneme is counted out；

3）Alignment stage

Based on training voice、The fundamental tone mark point information of middle matching phoneme, the voice duration alignment of source speaker phoneme corresponding with target speaker is realized using pitch synchronous splicing adding method, the frame length of pitch synchronous splicing adding processing is taken as the triple-length of the corresponding pitch period of current pitch mark point；In alignment procedure, on the basis of matching in phoneme duration compared with minor element, another phoneme is compressed by pitch synchronous splicing adding method and realizes alignment；Because PSOLA methods are the progress duration adjustment in units of pitch period, Adjustment precision only can guarantee that in a pitch period length, therefore the different information that current matching phoneme is adjusted will be included in next matching phoneme duration alignment and handled, and then alignment is realized by truncating mode for the unvoiced segments between phoneme in speech data；

By above-mentioned steps to voice

、

With target speaker's voice

。

3. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterized in that the speech parameter of training stage is decomposed, training voice after being aligned for the elapsed time, carries out parameter decomposition using STRAIGHT models, source speaker's voice is respectively obtained by decomposition

With target speaker's voice

Three groups of parameters：

1）The STRAIGHT spectrums of vocal tract spectrum feature are characterized, it is two-dimensional matrix：

Individual spectrum point, takes

, whole section of voice be divided into

Frame is analyzed, and at a distance of 10ms between each frame center's point, is used here

Expression source speaker's voice STRAIGHT is composed,

Represent target speaker's voice STRAIGHT spectrums；

2）The fundamental frequency of voice is trained,

, wherein

For voice

The fundamental frequency of frame, its spectrum the with STRAIGHT

Row are corresponding, use here

Expression source speaker's voice fundamental frequency,

Represent target speaker's voice fundamental frequency；

3）Aperiodic component

, it influences smaller to voice in the transfer to characterize the matrix of the aperiodic information characteristic of phonological component, thus without conversion processing.

4. the phonetics transfer method according to claim 1 based on convolution Non-negative Matrix Factorization, it is characterised in that the analytical procedure of training stage is as follows：

1）Source speaker STRAIGHT is composed using convolution non-negative matrix factorization method