CN107221321A

CN107221321A - A kind of phonetics transfer method being used between any source and target voice

Info

Publication number: CN107221321A
Application number: CN201710186569.5A
Authority: CN
Inventors: 简志华
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2017-09-29

Abstract

The invention discloses a kind of phonetics transfer method being directed between any source and target voice, comprise the following steps, step one, set up the tensor dictionary of at least one target voice；Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source；Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.The present invention, using voice signal in any source, tensor dictionary corresponding with the target voice to be changed, constructs respective voice dictionary using in same Zhang Zhangliang dictionaries, realizes that voice is changed；The parallel speech that substantial amounts of any source and target speaker was collected before being changed without each voice is trained, and practicality and voice transformation efficiency are higher.

Description

A kind of phonetics transfer method being used between any source and target voice

Technical field

The invention belongs to Voice Conversion Techniques field, and in particular to a kind of voice being used between any source and target source turns Change method.

Background technology

Voice signal is the voice signal of language, loads certain language meaning, wherein containing much information, such as Identity information, affective state and voice content of speaker etc..

Voice is changed, and is a kind of identity information with target speaker come the identity information of replacing source speaker, but is protected Hold a kind of constant technology of voice content.Voice conversion function is directed in many important application aspects：Emotion recognition is with turning Technical elements are changed, in terms of the text-to-speech system (TTS) that text information form is changed to voice messaging form, restoration methods side are composed Face, audio bandwidth expansion technical elements and people's reconstructed voice with helping dysphonia etc..

At present, the method for voice conversion has many kinds, and the classical method of the most frequently used comparison is following two：One class is Based on statistical method；One class is to be based on rarefaction representation.

In the phonetics transfer method based on statistical parameter, gauss hybrid models are employed most wide, in Gaussian Mixture mould , it is necessary to use transfer function to realize that weighting is averaged in type algorithm, and the parameter of transfer function is accurate with least mean-square error Then (Minimum Mean-Square Error, MMSE) or maximum-likelihood criterion (maximum likelihood, ML) are estimated Meter.Although this conversion method simple, intuitive, and effect is very well, there is shortcoming one is to need a large amount of parallel corporas to be instructed Practice, otherwise can produce over-fitting, two be that voice spectrum after conversion is excessively smooth, not enough naturally.

In the phonetics transfer method based on rarefaction representation, because rarefaction representation is widely used in signal transacting, Phonetics transfer method based on sample has also obtained very big development.D.Seung in 2001 etc. proposes Non-negative Matrix Factorization (Non-negative matrix factorization, NMF) voice conversion algorithm, this method is first by source speaker's voice Rarefaction representation, that is, be expressed as the product of voice dictionary and excitation matrix.Target speaker's voice dictionary is used in the conversion stage Realize that voice is changed instead of source speaker's voice dictionary.This method based on NMF can be improved effectively based on statistical parameter Over-fitting problem caused by method, produces more natural voice, and this method also has good noise robustness.But It is that there is also following shortcoming for this method：Need to collect enough source and target speakers' before each voice is changed Parallel speech be used for dictionary creation training stage, once therefore voice conversion stage source speaker's identity change, also It is that the voice that can not complete source speaker's voice to target speaker's voice converted.In actual applications, it is impossible to collect every A large amount of parallel speech section of one source speaker and target speaker are trained process, therefore, and the voice conversion based on NMF is calculated There is limitation in method, it is impossible to effectively quickly realize that the voice of any source and target voice is changed.

The content of the invention

The invention aims to solve the above problems, there is provided a kind of language being directed between any source and target speaker Sound conversion method, the training process and speech conversion process of the dictionary creation digitized the speech into are separated, in speech conversion process Need not because of voice change source voice and target voice identity change and remove re -training voice dictionary.The present invention exists Based on the concept that tensor is introduced on the basis of NMF methods, one, two or more target voices are chosen from corpus As the basic speech of voice tensor dictionary, by multisequencing dynamic time warping algorithm make this, two or more mesh The parallel speech section alignment of poster sound, so as to set up the tensor dictionary being made up of one, the basic dictionary of two or more two dimensions. Changed the stage in voice, source, target speaker voice can be by the linear combinations of each basic dictionary in tensor dictionary, construction Go out respective voice dictionary, realize voice conversion.

In order to reach foregoing invention purpose, the present invention uses following technical scheme：One kind is directed to any source and target voice Between phonetics transfer method, comprise the following steps,

Step one, the tensor dictionary of at least one basic speaker's voice is set up；

Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source；

Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.

Further, the step 2 is that the voice dictionary of target voice is constructed in tensor dictionary using tensor algorithm and is appointed The voice dictionary in meaning source.

Further, the step one set up the tensor dictionary of at least one target voice process it is specific as follows：

1) N number of target voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of target Semantic content identical voice signal x is randomly selected in voice₁,x₂,...,x_N, N >=1, X represents voice signal；

2) the feature parameter vector sequence S in each voice signal is extracted₁,S₂,…,S_N；

3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S₁,S₂,…,S_NAlignment, alignment Speech characteristic parameter vector sequence afterwards is S '₁,S′₂,…,S′_N；

4) it is S ' from feature parameter vector sequence₁,S′₂,…,S′_NRandomly select some speech frame Characteristic Vectors of same position Measure S "₁,S″₂,…,S″_N, it is used as a part for the tensor dictionary of each target voice；

5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat Step (2) arrives (4)；When the size of the tensor dictionary of each target voice is equal to the tensor dictionary preset value of the target voice, Stop step (2) and arrive (4), form N number of target voice dictionary A₁,A₂,…,A_N；

6) by N number of target voice dictionary A₁,A₂,…,A_NIt is stacked together, composition tensor dictionary A.

Further, the step 3) in refer to that alternative expression is dynamic using multisequencing using multisequencing dynamic time warping algorithm The synchronous method of state Time alignment, target voice feature parameter vector sequence S₁,S₂,…,S_NIn each feature parameter vector Sequence at most participates in multisequencing dynamic time warping twice.

Further, the step 2 builds the corresponding voice dictionary of target voice and includes procedure below,

1) for tensor dictionary A, the weight coefficient α of each target voice in calculating tensor dictionary A, weight coefficient α= [α₁,α₂,…,α_N], 0 ＜ α ＜ 1, N >=1；

2) linear combination being weighted to the tensor dictionary of each target voice

3) linear combination of weighting is multiplied with excitation matrix H, obtains the voice dictionary S of each target voice,

Wherein A_nRepresent the tensor dictionary of n-th of target voice in tensor dictionary A, α_n(0≤α_n≤ 1) represent n-th The weight coefficient of the tensor dictionary of target voice；

4) weight coefficient α and excitation matrix H cost function is calculated：

D () is from KL divergence ‖ ‖₁1 norm is represented, λ is sparse penalty factor, wherein 0≤λ≤1, to ensure excitation The degree of rarefication of matrix, H >=0, weight coefficient α=[α₁,α₂,…,α_N], it is necessary to ensure

5) keep the A value sizes in step (4) constant, it is continuous using the multiplication replacement criteria of Algorithms of Non-Negative Matrix Factorization

Iteration weight coefficient α and excitation matrix H, causes algorithm cost function value to reach minimum, obtains by running parameter α, H Arrive

Iterative formula：

WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrixs turn Put, the division sign in formula all represents a little to remove, λ is sparse penalty factor, 1 represents：All 1's matrix, k represents k-th of target voice；

6) using formula (1), (3), (4) calculate the feature vector sequence S of target voice_tgtUnder tensor dictionary A Rarefaction representation：

Wherein,α^tgtRepresent the feature vector sequence S of target voice_tgtTensor dictionary add Weight coefficient, H₂Represent the feature vector sequence S of target voice_tgtExcitation matrix；

7) using formula (1), (3), (4) calculate the feature vector sequence S in any source_srcIt is dilute under tensor dictionary A Dredge and represent：

Wherein,α^srcRepresent the tensor dictionary weight coefficient in any source, H₁Represent any The feature vector sequence S in source_srcExcitation matrix.

Further, the feature vector sequence S for calculating any source_srcWith arbitrary target speech characteristic vector sequence S_tgt In rarefaction representation step under tensor dictionary A：As the feature vector sequence S in any source_srcWith target voice feature vector sequence S_tgtIt is the voice extraction feature for the same voice content said by source and target speaker (target voice), then according to non-negative Matrix decomposition algorithm understands the excitation matrix H in rarefaction representation of two sequences under same tensor dictionary A₁And H₂Approximately equal.It is right Two sections of different voices in same speaker (same any source), such as target voice characteristic sequenceSame The weight coefficient of the tensor dictionary in rarefaction representation under tensor dictionary AAlso approximately equal, this is also readily appreciated that： Measure the original purpose of the weight coefficient under dictionary to seek to select the voice dictionary for belonging to speaker, when speaker's identity is constant When, the dictionary selected is also naturally constant.

Further, the step 3 is any source speech characteristic vector sequence S for needing to change of selection_srcExcitation square Battle array H₁It is multiplied with target voice dictionary, reconstructs the feature vector sequence of the target voice after conversion, the target after being changed Voice：

WhereinRepresent the feature vector sequence of the target voice after conversion.

Further, the tensor dictionary in the step one refers to make at least two by multisequencing dynamic time warping algorithm The parallel speech section alignment of target voice, sets up the tensor dictionary being made up of the basic dictionary of at least two two dimensions.

Further, the step one includes the constraint for the tensor dictionary size set up, and tensor dictionary size is：Speak on basis The language of people's number × characteristic dimension × number of speech frames, the number of basic speaker and the choice relation of dictionary language frame number to conversion Sound quality and the complexity of conversion.When basic speaker's number is very few,Target speaker represented by linear combination The voice dictionary of the voice dictionary of (target voice) just and in practice has larger deviation；When basic speaker's number is excessive, The complexity of calculating can be lifted again.Surface is tested, selects 10-20 basic speaker to can be very good to reconstruct target voice word Allusion quotation does not have very high computation complexity again simultaneously.The problem of selection of number of speech frames is equally faced with similar, experiment shows, selects When selecting 3000-3500 frame voices, preferable speech conversion effects can be obtained, while without higher computation complexity.

Further, the step 3 is specially the feature vector sequence and target language in the voice dictionary using target voice The voice dictionary of sound is made after linear weighted function, is multiplied with the excitation matrix in any source, realizes any source to the conversion of target voice.

Compared with prior art, beneficial effect is the present invention：1st, the present invention is using in same Zhang Zhangliang dictionaries, using appointing Voice signal in meaning source, tensor dictionary corresponding with the target voice to be changed, constructs respective voice dictionary, realizes that voice turns Change；2nd, the parallel speech that substantial amounts of any source and target speaker was collected before being changed without each voice is trained, real It is higher with property and voice transformation efficiency, on the basis of tensor dictionary, it is only necessary to calculate any source, the characteristic vector sequence of target voice Row, then reconstruct the voice content in any source, realize that voice is changed；3 do not need the parallel speech of source and target speaker to be used for Training dictionary, so the training stage of experiment can be independently of the experiment conversion stage；4 work as speech source and target voice to be converted Identity when changing, train the tensor dictionary come and keep constant, can effectively realize between any source and target speaker Voice conversion.

Brief description of the drawings

Fig. 1 is the frame diagram of the tensor dictionary creation process in the training stage of the invention；

Fig. 2 is one of many vector sequence DTW schematic diagrames during tensor dictionary creation；

Fig. 3 is two of many vector sequence DTW schematic diagrames during tensor dictionary creation；

Fig. 4 is three of many vector sequence DTW schematic diagrames during tensor dictionary creation；

Fig. 5 is the transfer process schematic diagram in the voice conversion stage.

Embodiment

Explanation is further described to technical scheme below by specific embodiment.

Embodiment 1

Present embodiment discloses a kind of phonetics transfer method being directed between any source and target voice, appointing in this programme Meaning source refers to the different phonetic that different phonetic object to be converted is sent.Method comprises the following steps：

Step one, the tensor dictionary of at least one target voice is set up；With reference to Fig. 1, specifically,

5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat Step (2) arrives (4)；When the size of the tensor dictionary of each target voice is equal to the default value of the target voice, stop step Suddenly (2) arrive (4), form N number of target voice dictionary A₁,A₂,…,A_N；

Step 2, builds the corresponding voice dictionary of target voice on the basis of tensor dictionary；Specially

1) for tensor dictionary A, the weight coefficient α of each target voice in calculating tensor dictionary A, weight coefficient α= [α₁,α₂,…,α_N], 0 ＜ α ＜ 1；

Wherein A_nRepresent the tensor dictionary of n-th of target voice in tensor dictionary A, α_n(0≤α_n≤ 1) represent n-th of mesh The weight coefficient of the tensor dictionary of poster sound；

4) weight coefficient α and excitation matrix H cost function is calculated：

D () selects KL divergences, ‖ ‖₁1 norm is represented, λ is sparse penalty factor, to ensure the sparse of excitation matrix Degree, H >=0, parameter alpha=[α₁,α₂,…,α_N],

5) holding A is constant, and running parameter α, H causes algorithm cost function value to reach minimum, obtains iterative formula：

6) above-mentioned steps are utilized, the feature vector sequence S of target voice is calculated_tgtIn the sparse table under tensor dictionary A Show：

7) voice dictionary in any source is built on the basis of tensor dictionary；Using formula (1), (3), (4), calculating is taken the post as The feature vector sequence S in meaning source_srcIn the rarefaction representation under tensor dictionary A：

Wherein,α^srcRepresent the tensor dictionary weight coefficient in any source, H₁Represent any The feature vector sequence S in source_srcExcitation matrix；

Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice；Any language in a steady stream of selection Sound dictionary S_srcExcitation matrix H₁It is multiplied with target voice dictionary, the target voice after being changed：

Above-mentioned steps two are to step 3 with reference to shown in Fig. 5.

The purpose of the phonetics transfer method of the present invention be voice that effectively and rapidly any people says be converted into it is specific some Target voice, the principle of conversion is：

In the training stage of dictionary creation, N number of target voice is chosen as the basic speech of tensor dictionary.From this N number of mesh Parallel speech section is extracted in poster sound, the basic language material of generation tensor dictionary is used as；

The characteristic vector of each voice segments is extracted, multisequencing dynamic time warping algorithm (Multi-Dynamic is utilized Time Warping, Multi-DTW) feature parameter vector is alignd；

The characteristic vector alignd of enough frame numbers is collected, they are generated as N number of symmetrical dictionary, this N number of dictionary is folded Put together, constitute tensor dictionary；

The stage is changed in voice, using based on unified tensor dictionary (Unified Tensor Dictionary, UTD) Voice conversion algorithm, using the automatic construction source in same tensor dictionary of UTD algorithms, the voice dictionary of target speaker, is realized Voice conversion between any source and target speaker.

In the above method, the tensor dictionary in step one refers to make N number of target by multisequencing dynamic time warping algorithm The parallel speech section alignment of voice, sets up the tensor dictionary that a N number of basic dictionary of two dimension is constituted.

In the above method, the feature vector sequence S in any source is calculated_srcWith arbitrary target speech characteristic vector sequence S_tgt In rarefaction representation step under tensor dictionary A：As the feature vector sequence S in any source_srcWith target voice feature vector sequence S_tgtIt is the voice extraction feature for the same voice content said by source and target speaker (target voice), then according to non-negative Matrix decomposition algorithm understands the excitation matrix H in rarefaction representation of two sequences under same tensor dictionary A₁And H₂Approximately equal.It is right Two sections of different voices in same speaker (same any source), such as target voice characteristic sequenceSame The weight coefficient of the tensor dictionary in rarefaction representation under tensor dictionary AAlso approximately equal, this is also readily appreciated that： The original purpose of weight coefficient under tensor dictionary seeks to select the voice dictionary for belonging to speaker (any source), works as speaker When identity is constant, the dictionary selected is also naturally constant.

Tensor dictionary refers to make the parallel speech section of at least two target voices by multisequencing dynamic time warping algorithm Alignment, sets up the tensor dictionary that a basic dictionary of at least two two dimensions is constituted.The foundation of tensor dictionary includes set up The constraint of dictionary size is measured, tensor dictionary size is：Basic speaker's number × characteristic dimension × number of speech frames.Basic speaker Number and dictionary language frame number choice relation to conversion voice quality and conversion complexity.When basic speaker's number When very few,The voice dictionary of target speaker (target voice) represented by linear combination just with voice word in practice Allusion quotation has larger deviation；When basic speaker's number is excessive, the complexity of calculating can be lifted again.Surface is tested, 10-20 is selected Individual basic speaker can be very good to reconstruct target voice dictionary simultaneously and not have very high computation complexity.Number of speech frames Selection the problem of be equally faced with similar, experiment shows, during selection 3000-3500 frame voices, can obtain preferable voice Conversion effect, while without higher computation complexity.

Embodiment 2

As different from Example 1, in generation tensor dictionary process, multisequencing dynamic time warping algorithm is utilized (Multi-Dynamic Time Warping, Multi-DTW) aligns feature parameter vector, may be referred to Fig. 2, Fig. 3 and Fig. 4 Shown, Fig. 2, Fig. 3 and Fig. 4 are many vector sequence DTW schematic diagrames during tensor dictionary creation, and figure includes N number of sequence pair The neat first step, second step and final step.Many vector sequence DTW are for making the dynamic time rule of multiple parallel speech alignment Whole algorithm.DTW between our known two voice sequences is in order that two parallel speech realize alignment, it is assumed that two voices Parameter vector sequence S₁And S₂It is f by change functions of the DTW after regular₁() and f₂().Phonetic feature ginseng after then changing Counting vector is：

S₁=f₁(S₁) (8)

S₂=f₂(S₂) (9)

When there is multiple parallel speech vector sequence S₁,S₂,…,S_NNeed alignment when, set forth herein many vector sequence DTW Algorithm is method of the alternative expression using DTW synchronizations, and each feature parameter vector sequence at most participates in DTW twice.So can be with Avoid working as and randomly select a feature parameter vector sequence S_iIt is synchronous one by one for N-1 feature parameter vector sequence of template and other And occur synchronization after feature parameter vector sequence it is long, deformation the problems such as.The detailed process of many vector sequence DTW algorithms is as follows It is shown：

(1) first to S₁,S₂Carry out DTW, it is assumed that the change function after each DTW all uses f₁() and f₂() represents, then The change of vector sequence such as formula (1), (2).As shown in Fig. 2 S before change₁,S₂It is unjustified, the S after change₁,S₂Alignment is presented State.

(2) next to S₂,S₃When carrying out DTW, speech parameter vector sequence, which becomes, to be turned to：

S₁=f₁(S₁),S₂=f₁(S₂) (10)

S₃=f₂(S₃) (11)

As shown in figure 3, to S₂,S₃Before progress DTW, S₁,S₂Aligned condition is just kept, so working as S₂,S₃Carry out When DTW changes, S₁Necessary and S₂The change maintained like.After finally changing through DTW, S₁,S₂,S₃Aligned condition is all presented.

(3) next in sequence respectively to (S₃,S₄),(S₄,S₅),…,(S_N-1,S_N) carry out feature parameter vector sequence Between DTW.With as Fig. 3, for the sequence alignd, follow-up change must being consistent property.

(4) finally to S_N-1,S_NDTW is carried out, the change of speech parameter vector sequence is as follows：

S′₁=f₁(S₁),S′₂=f₁(S₂),...,S′_N-1=f₁(S_N-1) (12)

S′_N=f₂(S_N) (13)

As shown in figure 4, S_N-1,S_NBefore progress DTW, S₁,S₂,…,S_N-1It has been the sequence of alignment, so in S_N-1, S_NWhen making DTW, S₁,S₂,…,S_N-1Need the changing rule maintained like.Work as S_N-1,S_NAfter having made DTW, sequence S₁,S₂,…, S_NAlignment is all realized, so the speech characteristic vector sequence S ' that sequence now has been reformed into after alignment₁,S′₂,…,S′_N。

It is to be used to realize the present invention and embodiment to be described above, and each step is example, those of ordinary skill in the art The actual step to be used can be determined according to actual conditions, and each step has a variety of implementation methods, all should belong to this hair Within the scope of bright.The scope of the present invention should not necessarily be limited by this description.It should be appreciated by those skilled in the art do not taking off From any modification or partial replacement of the scope of the present invention, belong to the claims in the present invention to limit scope.

Claims

1. a kind of phonetics transfer method being used between any source and target voice, comprises the following steps：

2. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute It is that the voice dictionary of target voice and the voice dictionary in any source are constructed in tensor dictionary using tensor algorithm to state step 2.

3. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute State step one specific as follows,

1) N number of basic speaker's voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of base Semantic content identical voice signal x is randomly selected in plinth speaker's voice₁,x₂,...,x_N, N >=1, X represents voice signal；

3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S₁,S₂,…,S_NAlignment, after alignment Speech characteristic parameter vector sequence is S '₁,S′₂,…,S′_N；

4) it is S ' from feature parameter vector sequence₁,S′₂,…,S′_NRandomly select some voice frame feature vector S of same position ″₁,S″₂,…,S″_N, it is used as a part for the tensor dictionary of each target voice；

5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat step (2) (4) are arrived；When the size of the tensor dictionary of each target voice is equal to the default value of the target voice, stop step (2) To (4), N number of basic speaker's voice dictionary A is formed₁,A₂,…,A_N；

6) by N number of basic speaker's voice dictionary A₁,A₂,…,A_NIt is stacked together, composition tensor dictionary A.

4. the phonetics transfer method according to claim 3 being used between any source and target voice, it is characterised in that

The step 3) in using multisequencing dynamic time warping algorithm refer to alternative expression using multisequencing dynamic time warping it is same The method of step, target voice feature parameter vector sequence S₁,S₂,…,S_NIn each feature parameter vector sequence at most participate in Multisequencing dynamic time warping twice.

5. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that

The step 2 builds the corresponding voice dictionary of target voice and includes procedure below,

1) tensor dictionary A is directed to, the weight coefficient α, weight coefficient α=[α of each target voice in tensor dictionary A is calculated₁, α₂,…,α_N], 0 ＜ α_n＜ 1；

3) linear combination of weighting obtains the voice dictionary of each target voice, then is multiplied with excitation matrix H, obtains target voice S,

<mrow> <mi>S</mi> <mo>&ap;</mo> <mrow> <mo>(</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mi>H</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein A_nRepresent the voice dictionary of n-th of target voice in tensor dictionary A, α_n(0≤α_n≤ 1) represent n-th of target language The weight coefficient of the tensor dictionary of sound；

4) weight coefficient α and excitation matrix H cost function is calculated：

<mrow> <mi>d</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <mo>(</mo> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> </mrow> <mo>)</mo> <mi>H</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>&lambda;</mi> <msub> <mrow> <mo>|</mo> <mrow> <mo>|</mo> <mi>H</mi> <mo>|</mo> </mrow> <mo>|</mo> </mrow> <mn>1</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> 1

D () selects KL divergences | | | |₁1 norm is represented, λ is sparse penalty factor, wherein 0≤λ≤1, to ensure to encourage square The degree of rarefication of battle array, H >=0, weight coefficient α=[α₁,α₂,…,α_N],

5) keep the A value sizes in step (4) constant, added using the continuous iteration of multiplication replacement criteria of Algorithms of Non-Negative Matrix Factorization Weight coefficient α and excitation matrix H cause algorithm cost function value to reach minimum, and wherein iterative formula is：

<mrow> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <mo>=</mo> <mfrac> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <mrow> <msub> <mi>&Sigma;</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>l</mi> </mrow> </msub> <msub> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> <mo>)</mo> </mrow> <mrow> <mi>d</mi> <mi>l</mi> </mrow> </msub> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>l</mi> </mrow> </munder> <msub> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>S</mi> <mo>&CircleTimes;</mo> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>&Sigma;</mi> <mi>n</mi> </msub> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mrow> <mi>d</mi> <mi>l</mi> </mrow> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <mi>H</mi> <mo>=</mo> <mi>H</mi> <mo>&CircleTimes;</mo> <mrow> <mo>(</mo> <msup> <mrow> <mo>(</mo> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <mo>(</mo> <mfrac> <mi>S</mi> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> </mrow> </mfrac> <mo>)</mo> <mo>)</mo> </mrow> <mo>/</mo> <mrow> <mo>(</mo> <msup> <mrow> <mo>(</mo> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <mn>1</mn> <mo>+</mo> <mi>&lambda;</mi> <mo>&CircleTimes;</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrix transposition, it is public The division sign in formula all represents a little to remove, and λ is sparse penalty factor, and 1 represents：All 1's matrix, k represents k-th of target voice；

6) using formula (1), (3) and (4), the feature vector sequence S of target voice is calculated_tgtDilute under tensor dictionary A Dredge and represent：

<mrow> <msub> <mi>S</mi> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msub> <mo>&ap;</mo> <mrow> <mo>(</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msubsup> <mi>&alpha;</mi> <mi>n</mi> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msubsup> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>H</mi> <mn>2</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>

Wherein,α^tgtRepresent the feature vector sequence S of target voice_tgtTensor dictionary weighting system Number, H₂Represent the feature vector sequence S of target voice_tgtExcitation matrix；

7) using formula (1), (3), (4), the feature vector sequence S in any source is calculated_srcSparse table under tensor dictionary A Show：

<mrow> <msub> <mi>S</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>&ap;</mo> <mrow> <mo>(</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msubsup> <mi>&alpha;</mi> <mi>n</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msubsup> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>H</mi> <mn>1</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>

Wherein,α^srcRepresent the tensor dictionary weight coefficient in any source, H₁Represent the spy in any source Levy vector sequence S_srcExcitation matrix.

6. the phonetics transfer method according to claim 5 being used between any source and target voice, it is characterised in that institute State the feature vector sequence S for calculating any source_srcWith arbitrary target speech characteristic vector sequence S_tgtIt is sparse under tensor dictionary A Represent in step：As the feature vector sequence S in any source_srcWith target voice feature vector sequence S_tgtIt is by any source and target The voice for the same voice content that speaker says extracts feature, then understands two sequences same according to Algorithms of Non-Negative Matrix Factorization The excitation matrix H in rarefaction representation under one tensor dictionary A₁And H₂Approximately equal；It is different for two sections of same any source Voice, i.e. target voice characteristic sequenceThe weighting system of the tensor dictionary in rarefaction representation under same tensor dictionary A NumberAlso approximately equal.

7. the phonetics transfer method according to claim 6 being used between any source and target voice, it is characterised in that institute State step 3 and select any source speech characteristic vector sequence S for needing to change_srcExcitation matrix H₁With target voice dictionary phase Multiply, reconstruct the feature vector sequence of the target voice after conversion, the target voice after being changed：

<mrow> <msub> <mover> <mi>S</mi> <mo>^</mo> </mover> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msubsup> <mi>&alpha;</mi> <mi>n</mi> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msubsup> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>H</mi> <mn>1</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> 2

8. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute The tensor dictionary in step one is stated to refer to make the parallel language of at least two target voices by multisequencing dynamic time warping algorithm Segment is alignd, and sets up the tensor dictionary that a basic dictionary of at least two two dimensions is constituted.

9. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute Stating step one includes the constraint of tensor dictionary size of foundation, and tensor dictionary size is：Basic speaker's number × characteristic dimension × number of speech frames.

10. the phonetics transfer method according to claim 1 being directed between any source and target voice, it is characterised in that The step 3 is specially made using the feature vector sequence and the voice dictionary of target voice in the voice dictionary of target voice After linear weighted function, it is multiplied with the excitation matrix in any source, realizes any source to the conversion of target voice.