CN107785030A - A kind of phonetics transfer method - Google Patents

A kind of phonetics transfer method Download PDF

Info

Publication number
CN107785030A
CN107785030A CN201710971228.9A CN201710971228A CN107785030A CN 107785030 A CN107785030 A CN 107785030A CN 201710971228 A CN201710971228 A CN 201710971228A CN 107785030 A CN107785030 A CN 107785030A
Authority
CN
China
Prior art keywords
mrow
msub
sound
source
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710971228.9A
Other languages
Chinese (zh)
Other versions
CN107785030B (en
Inventor
沈博
刘春华
蒋克文
童利航
余帅东
简志华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Hangzhou Electronic Science and Technology University
Original Assignee
Hangzhou Electronic Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Electronic Science and Technology University filed Critical Hangzhou Electronic Science and Technology University
Priority to CN201710971228.9A priority Critical patent/CN107785030B/en
Publication of CN107785030A publication Critical patent/CN107785030A/en
Application granted granted Critical
Publication of CN107785030B publication Critical patent/CN107785030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Abstract

The present invention provides a kind of phonetics transfer method, comprises the following steps:S1:Extract source of sound and the phonetic feature of target sound in voice data;S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;S3:With the voice of gauss hybrid models and clustering algorithm training after regular;S4:Extract the voice messaging of source of sound and changed it with the data of gained after training, synthesize target sound.The invention provides a kind of accurate efficient method realized by the sound mapping of source of sound for the sound of target sound, can be according to the mathematical characteristics spoken of source and target speaker, by being modeled to the two voice and carrying out algorithm computing, the voice of source speaker is accurately converted into the voice of target speaker.

Description

A kind of phonetics transfer method
Technical field
The present invention relates to computational algorithm field, more particularly to a kind of phonetics transfer method.
Background technology
At present, by domestic and international years of researches and application, in voice changes this field, it is recognized that transformation model be GMM is gauss hybrid models, and selects random initializtion when clustering average initialization to it, and full square is used when training and calculating Battle array calculates, and this clustering algorithm precision is higher.
During cluster average initialization, by the way of random initializtion, this make it that the randomness of calculating is too high, This is virtually extending the calculating time, and increases the probability to be made mistake under limited number of time iterated conditional.On the other hand, due to Covariance matrix after initialization is a perfect matrix, so in the step of calculating prior probability, it is the most numerous and the most jumbled cumbersome It is exactly the computing to covariance matrix, expands many operands here.
A kind of phonetics transfer method as disclosed in patent document CN107068165A, disclose a kind of voice conversion side Method, the system is first by carrying out adaptive GMM and bilinear frequency bending plus amplitude adjusted to Parallel Corpus Training, obtain voice conversion needed for transfer function, then using the transfer function carry out high quality voice change.This hair The bright dependency relation for speech characteristic parameter spatial distribution state and gauss hybrid models, use adaptive GMM Traditional Gauss mixed model is substituted, solves the problems, such as that gauss hybrid models are inaccurate when carrying out speech characteristic parameter classification, And be combined adaptive GMM and bilinear frequency bending plus amplitude adjusted, construct a kind of voice conversion system System.But in the patent document, not beforehand through more effective initial value in algorithm picks search space;Also do not calculating first When testing probability, by being handled matrix to improve arithmetic speed.
The content of the invention
The purpose of the present invention is to overcome deficiency of the prior art, there is provided a kind of accurate efficient phonetics transfer method.
The purpose of the present invention is achieved by following technical proposals.A kind of phonetics transfer method of the present invention, including under Row step:
S1:Extract source of sound and the phonetic feature of target sound;
S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;
S3:With the voice of gauss hybrid models and clustering algorithm training after regular;
S4:Extract the voice messaging of source of sound and changed with the data of gained after training, synthesize target sound.
Preferably, in step sl, the source of sound and the phonetic feature of target sound extract from voice data.
Described step S1 is specifically carried out as follows:
S1.1:Fundamental frequency information f0, aperiodic component are extracted from the voice data to prestore respectively using STRAIGHT models Ap, smooth power spectrum parameter sp;
S1.2:Using SPTK instrument dimensionality reductions, smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc, obtained To source of sound phonetic matrix X and target sound phonetic matrix Y.
In described step S1.1, prestore source of sound and target sound number, content identical voice data.
Step S2 is specifically realized as follows:
Two matrixes for differing length with dynamic time algorithm are changed into isometric x, y, and are one by two matrixes joint Individual matrix z.
Described step S3 is realized by following steps:
S3.1:Gauss hybrid models are initialized by matrix z;
S3.2:The Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models.
Described step S3.1 is realized by following steps:
S3.1.1:Matrix dimensionality M is set, Clustering Model number k, cluster average is calculated with K mean algorithms;
S3.1.2:Mixed coefficint is calculated by the number at each cluster midpoint;
S3.1.3:The data matrix C of each average is taken out from z, and the covariance matrix of the cluster is calculated by C.
Described step S3.2 is realized by following steps:
S3.2.1, first according to formula
Wherein, P (Zj|ui,∑i) represent prior probability, zjRepresent joint vector, uiRepresent mean value vector, ΣiRepresent association side Poor matrix, the transposition of T representing matrixs;
In Ben Shi and following formula, due to the matrix that each variable is M*M*K sizes, so special provide each symbol Subscript j represent row vector, i represent column vector, r represent r-th of matrix in block form;
Calculate prior probability P (Zj|ui,∑i);
S3.2.2, using Bayes' theorem, by formula
Wherein, αiRepresent the weight coefficient of each component of gauss hybrid models;
Calculate posterior probability λ (eji);
S3.2.3, the posterior probability calculated by above formula, following variable is calculated according to formula
αi(new)=ni/k
Wherein, N represents the quantity of the characteristic parameter of training voice;niRepresent all characteristic vectors in i-th of component Posterior probability sum, ui(new) mean value vector, the α of i-th of component after renewal are representedi(new) i-th point after renewal is represented The weight coefficient of amount, ∑i(new) covariance matrix of i-th of component after renewal is represented.
S3.2.4, three step iteration for several times, final weight coefficient α, covariance matrix ∑ will be drawn abovez, cluster average Matrix uz
Described step S4 is realized by following steps:
S4.1, with the u obtained after trainingz, ∑z, according to formula
Wherein, uxRepresent the characteristic parameter mean value vector of source of sound, uyRepresent the characteristic parameter mean value vector of target sound, ∑xx Represent the auto-covariance matrix of sound source parameter, ∑yyRepresent the auto-covariance matrix of target sound characteristic parameter, ∑xyAnd ∑yx Represent Cross-covariance;Draw source of sound and the cluster mean value vector u of target soundx, uy, draw the auto-covariance of source speaker ∑xx, and the cross covariance ∑ of the twoxy
S4.2, selects the voice messaging of any source of sound, and extracts the smooth power spectrum information under its STRAIGHT model Sp ', fundamental frequency information f0 ', aperiodic composition ap ', data matrix x is obtained by step S3.2tCalculate its prior probability P (ci| xt);
S4.3, pass through transfer function
Wherein, xtRepresent phonetic feature to be converted, ciRepresent i-th of component of gauss hybrid models, " -1 " representing matrix Inversion operation;Draw the Mel broad sense cepstrum parameter of synthesis voice;
S4.4, the parameter that above formula is calculated are converted into steady power spectrum, with reference to described aperiodic composition ap ', base Frequency information f0 ', target sound is synthesized by straight models.
Beneficial effect
The present invention has the following advantages compared with prior art:
1. in the first step of initialization gauss hybrid models, because K mean algorithms are realized easily, restrain soon, for big number It is very fast according to the collection speed of service, so employing K mean algorithms, an initial value more more effective than random initializtion is chosen, so With regard to the search space of the expectation-maximization algorithm of diminution, its arithmetic speed and precision are improved.
2. because speech data meets Gaussian Profile, so when calculating prior probability, covariance matrix can be passed through After cholesky is decomposed into diagonal matrix, then calculated, so substantially increase arithmetic speed.
Brief description of the drawings
The invention will be further described below in conjunction with the accompanying drawings.
Fig. 1 is a kind of basic flow sheet of one phonetics transfer method of the embodiment of the present invention;
Fig. 2 is a kind of training flow chart of one phonetics transfer method of the embodiment of the present invention;
Fig. 3 is a kind of conversion synthetic schemes of one phonetics transfer method of the embodiment of the present invention.
Embodiment
It is the specific embodiment of the present invention and with reference to accompanying drawing below, technical scheme is further described, But the present invention is not limited to these embodiments.
The present invention provides a kind of phonetics transfer method, and the technical principle of the invention is as follows:
A kind of phonetics transfer method of the present invention, have under conditions of voice data abundance, there is provided a kind of accurate It is efficient to realize the function of method of the sound mapping of source of sound for the sound of target sound.Can be according to source of sound and the sound of target sound The mathematical characteristics of sound, by being modeled to the two sound and carrying out algorithm computing, the sound of source of sound is accurately converted into target The sound of sound.
Embodiment one
The present embodiment provides a kind of computational algorithm, is specifically for use in a kind of phonetics transfer method, in the present embodiment, described Algorithm is needed under conditions of sufficient voice data is possessed, according to the mathematical characteristics of source of sound and the sound of target sound, by two Person's sound is modeled and carries out algorithm computing, and the sound of source of sound is accurately converted into the sound of target sound.
As shown in figure 1, a kind of described phonetics transfer method, including following steps:
S1:Extract source of sound and the phonetic feature of target sound;Described voice data includes:Source of sound is interior with both target sounds It is each more than 100 (subject matter is unlimited) to hold the same sentence number identical voice data, and described voice data also includes:Both sound The mathematical characteristics of sound;
S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;Described step S2 includes:With dynamic Two matrixes that time algorithm (DTW) differs length are changed into isometric X, Y so that the corresponding linear spectral frequency of source and target exists There is minimum distortion distance in setting distortion criterion, source of sound and the characteristic sequence of target sound is associated in parameter aspect, and It is a matrix z by two matrixes joint;
S3:With the voice of gauss hybrid models and clustering algorithm training after regular;Described step S3 includes:By upper Gauss hybrid models are initialized by the matrix z that one step obtains;With the expectation-maximization algorithm of gauss hybrid models to first Gauss model after beginningization is calculated, to reach the purpose for being mutually fitted source of sound with the model of target sound;
S4:Extract the voice messaging of source of sound and changed it with the data of gained after training, synthesize target sound.
Specific steps:
1. extract source of sound and the phonetic feature of target sound
1. the same sentence number identical voice data of the content for preparing both each more than 100 (subject matter is unlimited) uses STRAIGHT models therefrom extract fundamental frequency information f0, aperiodic component ap, smooth power spectrum parameter sp respectively.
2. smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc using SPTK instruments dimensionality reduction, at this moment To source source of sound phonetic matrix X and target sound phonetic matrix Y.
2. pair both phonetic matrixs carry out dynamic time warping
Two matrixes for differing length with dynamic time algorithm (DTW) are changed into isometric x, y so that corresponding source and The linear spectral frequency of target has minimum distortion distance in setting distortion criterion, makes the characteristic sequence of source and target people in parameter layer It is associated on face, and is a matrix z by two matrixes joint
3. with voice of the gauss hybrid models with clustering algorithm training after regular, as shown in Fig. 2 this step is divided into two Point:
(1) by matrix z obtained in the previous step, gauss hybrid models are initialized, realize that this purpose step is:
1st step, the dimension M of matrix, Clustering Model number k are set, cluster average u is calculated with K mean algorithmsz
2nd step, mixed coefficint is calculated by the number at each cluster midpoint.
3rd step, the data matrix C of each average is taken out from matrix z, and the covariance square of the cluster is calculated by C Battle array ∑z
(2) the Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models, to reach The purpose that the model of source speaker with target speaker are mutually fitted:
1st step, first according to formula
Wherein, P (Zj|ui,∑i) represent prior probability, zjRepresent joint vector, uiRepresent mean value vector, ΣiRepresent association side Poor matrix, the transposition of T representing matrixs;Provide that the subscript j of each symbol represents row vector, i represents column vector, and r represents r-th point Block matrix;
Calculate prior probability P (Zj|ui,∑i);Will association it is noted that having used cholesky to decompose in this course Variance matrix is decomposed into diagonal matrix, so on the premise of precision is ensured, improves arithmetic speed again.
2nd step, using Bayes' theorem, by formula
Wherein, αiRepresent the weight coefficient of each component of gauss hybrid models;
Calculate posterior probability λ (eji)。
3rd step, the posterior probability calculated by above formula, following variable is calculated according to formula
αi(new)=ni/k
Wherein, N represents the quantity of the characteristic parameter of training voice;niRepresent all characteristic vectors in i-th of component Posterior probability sum, ui(new) mean value vector, the α of i-th of component after renewal are representedi(new) i-th point after renewal is represented The weight coefficient of amount, ∑i(new) covariance matrix of i-th of component after renewal is represented.
4th step, 3 step iteration 20 times, draws final weight coefficient α, covariance matrix ∑ by more thanz, cluster average square Battle array uz
4. conversion and synthesis phase
1st step, with the u obtained after trainingz, ∑z, according to formula
Wherein, uxRepresent the characteristic parameter mean value vector of source of sound, uyRepresent the characteristic parameter mean value vector of target sound, ∑xx Represent the auto-covariance matrix of sound source parameter, ∑yyRepresent the auto-covariance matrix of target sound characteristic parameter, ∑xyAnd ∑yx Represent Cross-covariance;Draw source of sound and the mean value vector u of target soundx, uy, draw the auto-covariance ∑ of source speakerxx, with And the cross covariance ∑ of the twoxy
2nd step, it is any to jump the voice messaging of source speaker from one, and extract the smooth power under its STRAIGHT model Spectrum information sp ', fundamental frequency information f0 ', aperiodic composition ap ', then as previously mentioned method obtains data matrix xtCalculate Its prior probability P (ci|xt)。
3rd step, passes through transfer function
Wherein, xtRepresent phonetic feature to be converted, ciRepresent i-th of component of gauss hybrid models, " -1 " representing matrix Inversion operation;Draw the Mel broad sense cepstrum parameter of synthesis voice;
4th step, the parameter that above formula is calculated are converted into steady power spectrum, with reference to aperiodic composition ap ' above, base Frequency information f0 ', target voice is synthesized by straight models.
In the first step of initialization gauss hybrid models, because K mean algorithms are realized easily, restrain soon, for big data It is very fast to collect the speed of service, so employing K mean algorithms, chooses an initial value more more effective than random initializtion, thus The search space of the expectation-maximization algorithm of diminution, improve its arithmetic speed and precision.
On the other hand, because speech data meets Gaussian Profile, so when calculating prior probability, can be by covariance square After battle array is decomposed into diagonal matrix by cholesky, then calculated, so substantially increase arithmetic speed.
The present invention can be according to the mathematical characteristics spoken of source and target speaker, by being modeled simultaneously to the two voice Algorithm computing is carried out, the voice of source speaker is accurately converted into the voice of target speaker.Algorithm provided by the present invention exists While reducing amount of calculation, the accuracy of voice conversion is improved.
The general principle and principal character and advantages of the present invention of the present invention has been shown and described above.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the simply explanation described in above-described embodiment and specification is originally The principle of invention, without departing from the spirit and scope of the present invention, various changes and modifications of the present invention are possible, these changes Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its Equivalent thereof.Specific embodiment described herein is only to spirit explanation for example of the invention.Skill belonging to the present invention The technical staff in art field can make various modifications or supplement to described specific embodiment or use similar side Formula substitutes, but without departing from the spiritual of the present invention or surmounts scope defined in appended claims.

Claims (9)

1. a kind of phonetics transfer method, it is characterised in that including step:
S1:Extract source of sound and the phonetic feature of target sound;
S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;
S3:With the voice of gauss hybrid models and clustering algorithm training after regular;
S4:Extract the voice messaging of source of sound and changed with the data of gained after training, synthesize target sound.
A kind of 2. phonetics transfer method as claimed in claim 1, it is characterised in that:
In step sl, the source of sound and the phonetic feature of target sound extract from voice data.
3. a kind of phonetics transfer method as claimed in claim 1, it is characterised in that step S1 is specifically carried out as follows:
S1.1:Fundamental frequency information f0, aperiodic component ap are extracted from the voice data to prestore respectively using STRAIGHT models, is put down Sliding power spectrum parameters sp;
S1.2:Using SPTK instrument dimensionality reductions, smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc, obtains sound Source phonetic matrix X and target sound phonetic matrix Y.
4. a kind of phonetics transfer method as claimed in claim 3, it is characterised in that in step S1.1, prestore source of sound and mesh Mark with phonetic symbols number, content identical voice data.
5. a kind of phonetics transfer method as described in claim 3 or 4, it is characterised in that step S2 is specific real as follows It is existing:
Two matrixes for differing length with dynamic time algorithm are changed into isometric x, y, and are a square by two matrixes joint Battle array z.
6. a kind of phonetics transfer method as claimed in claim 5, it is characterised in that step S3 is realized by following steps:
S3.1:Gauss hybrid models are initialized by matrix z;
S3.2:The Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models.
7. a kind of phonetics transfer method as claimed in claim 6, it is characterised in that step S3.1 is realized by following steps:
S3.1.1:Matrix dimensionality M, Clustering Model number k are set, cluster average u is calculated with K mean algorithmsz
S3.1.2:Mixed coefficint is calculated by the number at each cluster midpoint;
S3.1.3:The data matrix C of each average is taken out from z, and the covariance matrix ∑ z of the cluster is calculated by C.
8. a kind of phonetics transfer method as claimed in claim 7, it is characterised in that step S3.2 is realized by following steps:
S3.2.1, first according to formula
Wherein, P (Zj|ui,∑i) represent prior probability, zjRepresent joint vector, uiRepresent mean value vector, ΣiRepresent covariance square Battle array, the transposition of T representing matrixs;Provide that the subscript j of each symbol represents row vector, i represents column vector, and r represents r-th of piecemeal square Battle array;
Calculate prior probability P (Zj|ui,∑i);
S3.2.2, using Bayes' theorem, by formula
Wherein, αiRepresent the weight coefficient of each component of gauss hybrid models;
Calculate posterior probability λ (eji);
S3.2.3, the posterior probability calculated by above formula, following variable is calculated according to formula
<mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>=</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>&amp;lambda;</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow>
<mrow> <msub> <mi>u</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mi>e</mi> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>&amp;lambda;</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> <msub> <mi>z</mi> <mi>j</mi> </msub> </mrow>
αi(new)=ni/k
<mrow> <msub> <mi>&amp;Sigma;</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mi>e</mi> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>&amp;lambda;</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mi>u</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mi>u</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> </mrow>
Wherein, N represents the quantity of the characteristic parameter of training voice;niRepresent that posteriority of all characteristic vectors in i-th of component is general Rate sum, ui(new) mean value vector, the α of i-th of component after renewal are representedi(new) power of i-th of component after renewal is represented Weight coefficient, ∑i(new) covariance matrix of i-th of component after renewal is represented.
S3.2.4, three step iteration for several times, final weight coefficient α, covariance matrix ∑ z will be drawn above, and cluster Mean Matrix uz
9. a kind of phonetics transfer method as claimed in claim 8, it is characterised in that step S4 is realized by following steps:
S4.1, with the u obtained after trainingz, ∑ z, according to formula
<mrow> <msub> <mi>u</mi> <mi>z</mi> </msub> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <msub> <mi>u</mi> <mi>x</mi> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>u</mi> <mi>y</mi> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>,</mo> <msub> <mo>&amp;Sigma;</mo> <mi>z</mi> </msub> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <msub> <mo>&amp;Sigma;</mo> <mrow> <mi>x</mi> <mi>x</mi> </mrow> </msub> <mo>,</mo> <msub> <mo>&amp;Sigma;</mo> <mrow> <mi>x</mi> <mi>y</mi> </mrow> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mo>&amp;Sigma;</mo> <mrow> <mi>y</mi> <mi>x</mi> </mrow> </msub> <mo>,</mo> <msub> <mo>&amp;Sigma;</mo> <mrow> <mi>y</mi> <mi>y</mi> </mrow> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>
Wherein, uxRepresent the characteristic parameter mean value vector of source of sound, uyThe characteristic parameter mean value vector of target sound is represented, ∑ xx is represented The auto-covariance matrix of sound source parameter, ∑ yy represent the auto-covariance matrix of target sound characteristic parameter, ∑ xy and ∑ yx tables Show Cross-covariance;Draw source of sound and the mean value vector u of target soundx, uy, the auto-covariance ∑ xx of source speaker is drawn, and The cross covariance ∑ xy of the two;
S4.2, selects the voice messaging of any source of sound, and extracts the smooth power spectrum information sp ' under its STRAIGHT model, base Frequency information f0 ', aperiodic composition ap ', data matrix x is obtained by step S3.2tCalculate its prior probability P (ci|xt);
S4.3, pass through transfer function
F(xt)=∑ P (ci|xt)[(uy+∑xxxy-1(xt-ux))]
Wherein, xtRepresent phonetic feature to be converted, ciI-th of component of gauss hybrid models is represented, " -1 " representing matrix is asked Inverse operation;
Draw the Mel broad sense cepstrum parameter of synthesis voice;
S4.4, the parameter that above formula is calculated are converted into steady power spectrum, with reference to described aperiodic composition ap ', fundamental frequency letter F0 ' is ceased, target sound is synthesized by straight models.
CN201710971228.9A 2017-10-18 2017-10-18 Voice conversion method Active CN107785030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710971228.9A CN107785030B (en) 2017-10-18 2017-10-18 Voice conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710971228.9A CN107785030B (en) 2017-10-18 2017-10-18 Voice conversion method

Publications (2)

Publication Number Publication Date
CN107785030A true CN107785030A (en) 2018-03-09
CN107785030B CN107785030B (en) 2021-04-30

Family

ID=61434640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710971228.9A Active CN107785030B (en) 2017-10-18 2017-10-18 Voice conversion method

Country Status (1)

Country Link
CN (1) CN107785030B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes
CN111108558A (en) * 2019-12-20 2020-05-05 深圳市优必选科技股份有限公司 Voice conversion method and device, computer equipment and computer readable storage medium
CN111564158A (en) * 2020-04-29 2020-08-21 上海紫荆桃李科技有限公司 Configurable sound changing device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN103063899A (en) * 2012-12-20 2013-04-24 中国科学院西安光学精密机械研究所 Sensing optical fiber ring and reflective all-optical fiber current transformer
CN104091592A (en) * 2014-07-02 2014-10-08 常州工学院 Voice conversion system based on hidden Gaussian random field
CN105206259A (en) * 2015-11-03 2015-12-30 常州工学院 Voice conversion method
CN106205623A (en) * 2016-06-17 2016-12-07 福建星网视易信息系统有限公司 A kind of sound converting method and device
CN107103914A (en) * 2017-03-20 2017-08-29 南京邮电大学 A kind of high-quality phonetics transfer method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN103063899A (en) * 2012-12-20 2013-04-24 中国科学院西安光学精密机械研究所 Sensing optical fiber ring and reflective all-optical fiber current transformer
CN104091592A (en) * 2014-07-02 2014-10-08 常州工学院 Voice conversion system based on hidden Gaussian random field
CN105206259A (en) * 2015-11-03 2015-12-30 常州工学院 Voice conversion method
CN106205623A (en) * 2016-06-17 2016-12-07 福建星网视易信息系统有限公司 A kind of sound converting method and device
CN107103914A (en) * 2017-03-20 2017-08-29 南京邮电大学 A kind of high-quality phonetics transfer method

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
李健: "基于GMM的汉语语音转换系统研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李波: "语音转换的关键技术研究", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 *
李清华: "语音转换技术研究及实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
杨骋等: "基于简化STRAIGHT模型的语音信号重构", 《指挥信息系统与技术》 *
简志华等: "语声转换技术发展及展望", 《南京邮电大学学报(自然科学版)》 *
袁志明: "基于高斯混合模型和K-均值聚类算法的RBF神经网络实现男女声转换", 《黑龙江科技信息》 *
解伟超: "语音转换中声道谱参数和基频变换算法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
陈先同: "语音转换中特征参数及其转换方法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
马欢: "基于STRAIGHT模型的语音转换的研究", 《电脑与电信》 *
鲁博: "语音转换技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes
CN110097890B (en) * 2019-04-16 2021-11-02 北京搜狗科技发展有限公司 Voice processing method and device for voice processing
CN111108558A (en) * 2019-12-20 2020-05-05 深圳市优必选科技股份有限公司 Voice conversion method and device, computer equipment and computer readable storage medium
CN111108558B (en) * 2019-12-20 2023-08-04 深圳市优必选科技股份有限公司 Voice conversion method, device, computer equipment and computer readable storage medium
CN111564158A (en) * 2020-04-29 2020-08-21 上海紫荆桃李科技有限公司 Configurable sound changing device

Also Published As

Publication number Publication date
CN107785030B (en) 2021-04-30

Similar Documents

Publication Publication Date Title
US11482207B2 (en) Waveform generation using end-to-end text-to-waveform system
Saito et al. One-to-many voice conversion based on tensor representation of speaker space
Toda et al. One-to-many and many-to-one voice conversion based on eigenvoices
CN108461079A (en) A kind of song synthetic method towards tone color conversion
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
CN101833951B (en) Multi-background modeling method for speaker recognition
CN104392718B (en) A kind of robust speech recognition methods based on acoustic model array
JP3412496B2 (en) Speaker adaptation device and speech recognition device
CN102306492B (en) Voice conversion method based on convolutive nonnegative matrix factorization
JP2013205697A (en) Speech synthesizer, speech synthesis method, speech synthesis program and learning device
CN110060701A (en) Multi-to-multi phonetics transfer method based on VAWGAN-AC
CN107785030A (en) A kind of phonetics transfer method
CN104217721B (en) Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns
CN107301859A (en) Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss
CN107333238A (en) A kind of indoor fingerprint method for rapidly positioning based on support vector regression
CN103280224A (en) Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm
CN110265051A (en) The sightsinging audio intelligent scoring modeling method of education is sung applied to root LeEco
CN110047501A (en) Multi-to-multi phonetics transfer method based on beta-VAE
CN110085254A (en) Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
CN106847248A (en) Chord recognition methods based on robustness scale contour feature and vector machine
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
CN109584893A (en) Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
Chien et al. Evaluation of glottal inverse filtering algorithms using a physiologically based articulatory speech synthesizer
CN103413548A (en) Voice conversion method of united frequency-spectrum modeling based on restricted boltzman machine
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant