CN107785030A

CN107785030A - A kind of phonetics transfer method

Info

Publication number: CN107785030A
Application number: CN201710971228.9A
Authority: CN
Inventors: 沈博; 刘春华; 蒋克文; 童利航; 余帅东; 简志华
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2017-10-18
Filing date: 2017-10-18
Publication date: 2018-03-09
Anticipated expiration: 2037-10-18
Also published as: CN107785030B

Abstract

The present invention provides a kind of phonetics transfer method, comprises the following steps：S1：Extract source of sound and the phonetic feature of target sound in voice data；S2：Dynamic time warping is carried out to source of sound and the voice messaging of target sound；S3：With the voice of gauss hybrid models and clustering algorithm training after regular；S4：Extract the voice messaging of source of sound and changed it with the data of gained after training, synthesize target sound.The invention provides a kind of accurate efficient method realized by the sound mapping of source of sound for the sound of target sound, can be according to the mathematical characteristics spoken of source and target speaker, by being modeled to the two voice and carrying out algorithm computing, the voice of source speaker is accurately converted into the voice of target speaker.

Description

A kind of phonetics transfer method

Technical field

The present invention relates to computational algorithm field, more particularly to a kind of phonetics transfer method.

Background technology

At present, by domestic and international years of researches and application, in voice changes this field, it is recognized that transformation model be GMM is gauss hybrid models, and selects random initializtion when clustering average initialization to it, and full square is used when training and calculating Battle array calculates, and this clustering algorithm precision is higher.

During cluster average initialization, by the way of random initializtion, this make it that the randomness of calculating is too high, This is virtually extending the calculating time, and increases the probability to be made mistake under limited number of time iterated conditional.On the other hand, due to Covariance matrix after initialization is a perfect matrix, so in the step of calculating prior probability, it is the most numerous and the most jumbled cumbersome It is exactly the computing to covariance matrix, expands many operands here.

A kind of phonetics transfer method as disclosed in patent document CN107068165A, disclose a kind of voice conversion side Method, the system is first by carrying out adaptive GMM and bilinear frequency bending plus amplitude adjusted to Parallel Corpus Training, obtain voice conversion needed for transfer function, then using the transfer function carry out high quality voice change.This hair The bright dependency relation for speech characteristic parameter spatial distribution state and gauss hybrid models, use adaptive GMM Traditional Gauss mixed model is substituted, solves the problems, such as that gauss hybrid models are inaccurate when carrying out speech characteristic parameter classification, And be combined adaptive GMM and bilinear frequency bending plus amplitude adjusted, construct a kind of voice conversion system System.But in the patent document, not beforehand through more effective initial value in algorithm picks search space；Also do not calculating first When testing probability, by being handled matrix to improve arithmetic speed.

The content of the invention

The purpose of the present invention is to overcome deficiency of the prior art, there is provided a kind of accurate efficient phonetics transfer method.

The purpose of the present invention is achieved by following technical proposals.A kind of phonetics transfer method of the present invention, including under Row step：

S1：Extract source of sound and the phonetic feature of target sound；

S2：Dynamic time warping is carried out to source of sound and the voice messaging of target sound；

S3：With the voice of gauss hybrid models and clustering algorithm training after regular；

S4：Extract the voice messaging of source of sound and changed with the data of gained after training, synthesize target sound.

Preferably, in step sl, the source of sound and the phonetic feature of target sound extract from voice data.

Described step S1 is specifically carried out as follows：

S1.1：Fundamental frequency information f0, aperiodic component are extracted from the voice data to prestore respectively using STRAIGHT models Ap, smooth power spectrum parameter sp；

S1.2：Using SPTK instrument dimensionality reductions, smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc, obtained To source of sound phonetic matrix X and target sound phonetic matrix Y.

In described step S1.1, prestore source of sound and target sound number, content identical voice data.

Step S2 is specifically realized as follows：

Two matrixes for differing length with dynamic time algorithm are changed into isometric x, y, and are one by two matrixes joint Individual matrix z.

Described step S3 is realized by following steps：

S3.1：Gauss hybrid models are initialized by matrix z；

S3.2：The Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models.

Described step S3.1 is realized by following steps：

S3.1.1：Matrix dimensionality M is set, Clustering Model number k, cluster average is calculated with K mean algorithms；

S3.1.2：Mixed coefficint is calculated by the number at each cluster midpoint；

S3.1.3：The data matrix C of each average is taken out from z, and the covariance matrix of the cluster is calculated by C.

Described step S3.2 is realized by following steps：

S3.2.1, first according to formula

Wherein, P (Z_j|u_i,∑_i) represent prior probability, z_jRepresent joint vector, u_iRepresent mean value vector, Σ_iRepresent association side Poor matrix, the transposition of T representing matrixs；

In Ben Shi and following formula, due to the matrix that each variable is M*M*K sizes, so special provide each symbol Subscript j represent row vector, i represent column vector, r represent r-th of matrix in block form；

Calculate prior probability P (Z_j|u_i,∑_i)；

S3.2.2, using Bayes' theorem, by formula

Wherein, α_iRepresent the weight coefficient of each component of gauss hybrid models；

Calculate posterior probability λ (e_ji)；

S3.2.3, the posterior probability calculated by above formula, following variable is calculated according to formula

α_i(new)=n_i/k

Wherein, N represents the quantity of the characteristic parameter of training voice；n_iRepresent all characteristic vectors in i-th of component Posterior probability sum, u_i(new) mean value vector, the α of i-th of component after renewal are represented_i(new) i-th point after renewal is represented The weight coefficient of amount, ∑_i(new) covariance matrix of i-th of component after renewal is represented.

S3.2.4, three step iteration for several times, final weight coefficient α, covariance matrix ∑ will be drawn above_z, cluster average Matrix u_z。

Described step S4 is realized by following steps：

S4.1, with the u obtained after training_z, ∑_z, according to formula

Wherein, u_xRepresent the characteristic parameter mean value vector of source of sound, u_yRepresent the characteristic parameter mean value vector of target sound, ∑_xx Represent the auto-covariance matrix of sound source parameter, ∑_yyRepresent the auto-covariance matrix of target sound characteristic parameter, ∑_xyAnd ∑_yx Represent Cross-covariance；Draw source of sound and the cluster mean value vector u of target sound_x, u_y, draw the auto-covariance of source speaker ∑_xx, and the cross covariance ∑ of the two_xy；

S4.2, selects the voice messaging of any source of sound, and extracts the smooth power spectrum information under its STRAIGHT model Sp ', fundamental frequency information f0 ', aperiodic composition ap ', data matrix x is obtained by step S3.2_tCalculate its prior probability P (c_i| x_t)；

S4.3, pass through transfer function

Wherein, x_tRepresent phonetic feature to be converted, c_iRepresent i-th of component of gauss hybrid models, " -1 " representing matrix Inversion operation；Draw the Mel broad sense cepstrum parameter of synthesis voice；

S4.4, the parameter that above formula is calculated are converted into steady power spectrum, with reference to described aperiodic composition ap ', base Frequency information f0 ', target sound is synthesized by straight models.

Beneficial effect

The present invention has the following advantages compared with prior art：

1. in the first step of initialization gauss hybrid models, because K mean algorithms are realized easily, restrain soon, for big number It is very fast according to the collection speed of service, so employing K mean algorithms, an initial value more more effective than random initializtion is chosen, so With regard to the search space of the expectation-maximization algorithm of diminution, its arithmetic speed and precision are improved.

2. because speech data meets Gaussian Profile, so when calculating prior probability, covariance matrix can be passed through After cholesky is decomposed into diagonal matrix, then calculated, so substantially increase arithmetic speed.

Brief description of the drawings

The invention will be further described below in conjunction with the accompanying drawings.

Fig. 1 is a kind of basic flow sheet of one phonetics transfer method of the embodiment of the present invention；

Fig. 2 is a kind of training flow chart of one phonetics transfer method of the embodiment of the present invention；

Fig. 3 is a kind of conversion synthetic schemes of one phonetics transfer method of the embodiment of the present invention.

Embodiment

It is the specific embodiment of the present invention and with reference to accompanying drawing below, technical scheme is further described, But the present invention is not limited to these embodiments.

The present invention provides a kind of phonetics transfer method, and the technical principle of the invention is as follows：

A kind of phonetics transfer method of the present invention, have under conditions of voice data abundance, there is provided a kind of accurate It is efficient to realize the function of method of the sound mapping of source of sound for the sound of target sound.Can be according to source of sound and the sound of target sound The mathematical characteristics of sound, by being modeled to the two sound and carrying out algorithm computing, the sound of source of sound is accurately converted into target The sound of sound.

Embodiment one

The present embodiment provides a kind of computational algorithm, is specifically for use in a kind of phonetics transfer method, in the present embodiment, described Algorithm is needed under conditions of sufficient voice data is possessed, according to the mathematical characteristics of source of sound and the sound of target sound, by two Person's sound is modeled and carries out algorithm computing, and the sound of source of sound is accurately converted into the sound of target sound.

As shown in figure 1, a kind of described phonetics transfer method, including following steps：

S1：Extract source of sound and the phonetic feature of target sound；Described voice data includes：Source of sound is interior with both target sounds It is each more than 100 (subject matter is unlimited) to hold the same sentence number identical voice data, and described voice data also includes：Both sound The mathematical characteristics of sound；

S2：Dynamic time warping is carried out to source of sound and the voice messaging of target sound；Described step S2 includes：With dynamic Two matrixes that time algorithm (DTW) differs length are changed into isometric X, Y so that the corresponding linear spectral frequency of source and target exists There is minimum distortion distance in setting distortion criterion, source of sound and the characteristic sequence of target sound is associated in parameter aspect, and It is a matrix z by two matrixes joint；

S3：With the voice of gauss hybrid models and clustering algorithm training after regular；Described step S3 includes：By upper Gauss hybrid models are initialized by the matrix z that one step obtains；With the expectation-maximization algorithm of gauss hybrid models to first Gauss model after beginningization is calculated, to reach the purpose for being mutually fitted source of sound with the model of target sound；

S4：Extract the voice messaging of source of sound and changed it with the data of gained after training, synthesize target sound.

Specific steps：

1. extract source of sound and the phonetic feature of target sound

1. the same sentence number identical voice data of the content for preparing both each more than 100 (subject matter is unlimited) uses STRAIGHT models therefrom extract fundamental frequency information f0, aperiodic component ap, smooth power spectrum parameter sp respectively.

2. smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc using SPTK instruments dimensionality reduction, at this moment To source source of sound phonetic matrix X and target sound phonetic matrix Y.

2. pair both phonetic matrixs carry out dynamic time warping

Two matrixes for differing length with dynamic time algorithm (DTW) are changed into isometric x, y so that corresponding source and The linear spectral frequency of target has minimum distortion distance in setting distortion criterion, makes the characteristic sequence of source and target people in parameter layer It is associated on face, and is a matrix z by two matrixes joint

3. with voice of the gauss hybrid models with clustering algorithm training after regular, as shown in Fig. 2 this step is divided into two Point：

(1) by matrix z obtained in the previous step, gauss hybrid models are initialized, realize that this purpose step is：

1st step, the dimension M of matrix, Clustering Model number k are set, cluster average u is calculated with K mean algorithms_z。

2nd step, mixed coefficint is calculated by the number at each cluster midpoint.

3rd step, the data matrix C of each average is taken out from matrix z, and the covariance square of the cluster is calculated by C Battle array ∑_z。

(2) the Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models, to reach The purpose that the model of source speaker with target speaker are mutually fitted：

1st step, first according to formula

Wherein, P (Z_j|u_i,∑_i) represent prior probability, z_jRepresent joint vector, u_iRepresent mean value vector, Σ_iRepresent association side Poor matrix, the transposition of T representing matrixs；Provide that the subscript j of each symbol represents row vector, i represents column vector, and r represents r-th point Block matrix；

Calculate prior probability P (Z_j|u_i,∑_i)；Will association it is noted that having used cholesky to decompose in this course Variance matrix is decomposed into diagonal matrix, so on the premise of precision is ensured, improves arithmetic speed again.

2nd step, using Bayes' theorem, by formula

Calculate posterior probability λ (e_ji)。

3rd step, the posterior probability calculated by above formula, following variable is calculated according to formula

α_i(new)=n_i/k

4th step, 3 step iteration 20 times, draws final weight coefficient α, covariance matrix ∑ by more than_z, cluster average square Battle array u_z。

4. conversion and synthesis phase

1st step, with the u obtained after training_z, ∑_z, according to formula

Wherein, u_xRepresent the characteristic parameter mean value vector of source of sound, u_yRepresent the characteristic parameter mean value vector of target sound, ∑_xx Represent the auto-covariance matrix of sound source parameter, ∑_yyRepresent the auto-covariance matrix of target sound characteristic parameter, ∑_xyAnd ∑_yx Represent Cross-covariance；Draw source of sound and the mean value vector u of target sound_x, u_y, draw the auto-covariance ∑ of source speaker_xx, with And the cross covariance ∑ of the two_xy；

2nd step, it is any to jump the voice messaging of source speaker from one, and extract the smooth power under its STRAIGHT model Spectrum information sp ', fundamental frequency information f0 ', aperiodic composition ap ', then as previously mentioned method obtains data matrix x_tCalculate Its prior probability P (c_i|x_t)。

3rd step, passes through transfer function

4th step, the parameter that above formula is calculated are converted into steady power spectrum, with reference to aperiodic composition ap ' above, base Frequency information f0 ', target voice is synthesized by straight models.

In the first step of initialization gauss hybrid models, because K mean algorithms are realized easily, restrain soon, for big data It is very fast to collect the speed of service, so employing K mean algorithms, chooses an initial value more more effective than random initializtion, thus The search space of the expectation-maximization algorithm of diminution, improve its arithmetic speed and precision.

On the other hand, because speech data meets Gaussian Profile, so when calculating prior probability, can be by covariance square After battle array is decomposed into diagonal matrix by cholesky, then calculated, so substantially increase arithmetic speed.

The present invention can be according to the mathematical characteristics spoken of source and target speaker, by being modeled simultaneously to the two voice Algorithm computing is carried out, the voice of source speaker is accurately converted into the voice of target speaker.Algorithm provided by the present invention exists While reducing amount of calculation, the accuracy of voice conversion is improved.

The general principle and principal character and advantages of the present invention of the present invention has been shown and described above.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the simply explanation described in above-described embodiment and specification is originally The principle of invention, without departing from the spirit and scope of the present invention, various changes and modifications of the present invention are possible, these changes Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its Equivalent thereof.Specific embodiment described herein is only to spirit explanation for example of the invention.Skill belonging to the present invention The technical staff in art field can make various modifications or supplement to described specific embodiment or use similar side Formula substitutes, but without departing from the spiritual of the present invention or surmounts scope defined in appended claims.

Claims

1. a kind of phonetics transfer method, it is characterised in that including step：

S1：Extract source of sound and the phonetic feature of target sound；

A kind of 2. phonetics transfer method as claimed in claim 1, it is characterised in that：

In step sl, the source of sound and the phonetic feature of target sound extract from voice data.

3. a kind of phonetics transfer method as claimed in claim 1, it is characterised in that step S1 is specifically carried out as follows：

S1.1：Fundamental frequency information f0, aperiodic component ap are extracted from the voice data to prestore respectively using STRAIGHT models, is put down Sliding power spectrum parameters sp；

S1.2：Using SPTK instrument dimensionality reductions, smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc, obtains sound Source phonetic matrix X and target sound phonetic matrix Y.

4. a kind of phonetics transfer method as claimed in claim 3, it is characterised in that in step S1.1, prestore source of sound and mesh Mark with phonetic symbols number, content identical voice data.

5. a kind of phonetics transfer method as described in claim 3 or 4, it is characterised in that step S2 is specific real as follows It is existing：

Two matrixes for differing length with dynamic time algorithm are changed into isometric x, y, and are a square by two matrixes joint Battle array z.

6. a kind of phonetics transfer method as claimed in claim 5, it is characterised in that step S3 is realized by following steps：

S3.1：Gauss hybrid models are initialized by matrix z；

7. a kind of phonetics transfer method as claimed in claim 6, it is characterised in that step S3.1 is realized by following steps：

S3.1.1：Matrix dimensionality M, Clustering Model number k are set, cluster average u is calculated with K mean algorithms_z；

S3.1.3：The data matrix C of each average is taken out from z, and the covariance matrix ∑ z of the cluster is calculated by C.

8. a kind of phonetics transfer method as claimed in claim 7, it is characterised in that step S3.2 is realized by following steps：

S3.2.1, first according to formula

Wherein, P (Z_j|u_i,∑_i) represent prior probability, z_jRepresent joint vector, u_iRepresent mean value vector, Σ_iRepresent covariance square Battle array, the transposition of T representing matrixs；Provide that the subscript j of each symbol represents row vector, i represents column vector, and r represents r-th of piecemeal square Battle array；

Calculate prior probability P (Z_j|u_i,∑_i)；

S3.2.2, using Bayes' theorem, by formula

Calculate posterior probability λ (e_ji)；

<mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>=</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>&lambda;</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow>

<mrow> <msub> <mi>u</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mi>e</mi> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>&lambda;</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> <msub> <mi>z</mi> <mi>j</mi> </msub> </mrow>

α_i(new)=n_i/k

<mrow> <msub> <mi>&Sigma;</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mi>e</mi> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>&lambda;</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mi>u</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mi>u</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> </mrow>

Wherein, N represents the quantity of the characteristic parameter of training voice；n_iRepresent that posteriority of all characteristic vectors in i-th of component is general Rate sum, u_i(new) mean value vector, the α of i-th of component after renewal are represented_i(new) power of i-th of component after renewal is represented Weight coefficient, ∑_i(new) covariance matrix of i-th of component after renewal is represented.

S3.2.4, three step iteration for several times, final weight coefficient α, covariance matrix ∑ z will be drawn above, and cluster Mean Matrix u_z。

9. a kind of phonetics transfer method as claimed in claim 8, it is characterised in that step S4 is realized by following steps：

S4.1, with the u obtained after training_z, ∑ z, according to formula

<mrow> <msub> <mi>u</mi> <mi>z</mi> </msub> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <msub> <mi>u</mi> <mi>x</mi> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>u</mi> <mi>y</mi> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>,</mo> <msub> <mo>&Sigma;</mo> <mi>z</mi> </msub> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <msub> <mo>&Sigma;</mo> <mrow> <mi>x</mi> <mi>x</mi> </mrow> </msub> <mo>,</mo> <msub> <mo>&Sigma;</mo> <mrow> <mi>x</mi> <mi>y</mi> </mrow> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mo>&Sigma;</mo> <mrow> <mi>y</mi> <mi>x</mi> </mrow> </msub> <mo>,</mo> <msub> <mo>&Sigma;</mo> <mrow> <mi>y</mi> <mi>y</mi> </mrow> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>

Wherein, u_xRepresent the characteristic parameter mean value vector of source of sound, u_yThe characteristic parameter mean value vector of target sound is represented, ∑ xx is represented The auto-covariance matrix of sound source parameter, ∑ yy represent the auto-covariance matrix of target sound characteristic parameter, ∑ xy and ∑ yx tables Show Cross-covariance；Draw source of sound and the mean value vector u of target sound_x, u_y, the auto-covariance ∑ xx of source speaker is drawn, and The cross covariance ∑ xy of the two；

S4.2, selects the voice messaging of any source of sound, and extracts the smooth power spectrum information sp ' under its STRAIGHT model, base Frequency information f0 ', aperiodic composition ap ', data matrix x is obtained by step S3.2_tCalculate its prior probability P (c_i|x_t)；

S4.3, pass through transfer function

F(x_t)=∑ P (c_i|x_t)[(u^y+∑^xx∑^xy-1(x_t-u^x))]

Wherein, x_tRepresent phonetic feature to be converted, c_iI-th of component of gauss hybrid models is represented, " -1 " representing matrix is asked Inverse operation；

Draw the Mel broad sense cepstrum parameter of synthesis voice；

S4.4, the parameter that above formula is calculated are converted into steady power spectrum, with reference to described aperiodic composition ap ', fundamental frequency letter F0 ' is ceased, target sound is synthesized by straight models.