CN107785030A - A kind of phonetics transfer method - Google Patents
A kind of phonetics transfer method Download PDFInfo
- Publication number
- CN107785030A CN107785030A CN201710971228.9A CN201710971228A CN107785030A CN 107785030 A CN107785030 A CN 107785030A CN 201710971228 A CN201710971228 A CN 201710971228A CN 107785030 A CN107785030 A CN 107785030A
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- sound
- source
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
Abstract
The present invention provides a kind of phonetics transfer method, comprises the following steps:S1:Extract source of sound and the phonetic feature of target sound in voice data;S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;S3:With the voice of gauss hybrid models and clustering algorithm training after regular;S4:Extract the voice messaging of source of sound and changed it with the data of gained after training, synthesize target sound.The invention provides a kind of accurate efficient method realized by the sound mapping of source of sound for the sound of target sound, can be according to the mathematical characteristics spoken of source and target speaker, by being modeled to the two voice and carrying out algorithm computing, the voice of source speaker is accurately converted into the voice of target speaker.
Description
Technical field
The present invention relates to computational algorithm field, more particularly to a kind of phonetics transfer method.
Background technology
At present, by domestic and international years of researches and application, in voice changes this field, it is recognized that transformation model be
GMM is gauss hybrid models, and selects random initializtion when clustering average initialization to it, and full square is used when training and calculating
Battle array calculates, and this clustering algorithm precision is higher.
During cluster average initialization, by the way of random initializtion, this make it that the randomness of calculating is too high,
This is virtually extending the calculating time, and increases the probability to be made mistake under limited number of time iterated conditional.On the other hand, due to
Covariance matrix after initialization is a perfect matrix, so in the step of calculating prior probability, it is the most numerous and the most jumbled cumbersome
It is exactly the computing to covariance matrix, expands many operands here.
A kind of phonetics transfer method as disclosed in patent document CN107068165A, disclose a kind of voice conversion side
Method, the system is first by carrying out adaptive GMM and bilinear frequency bending plus amplitude adjusted to Parallel Corpus
Training, obtain voice conversion needed for transfer function, then using the transfer function carry out high quality voice change.This hair
The bright dependency relation for speech characteristic parameter spatial distribution state and gauss hybrid models, use adaptive GMM
Traditional Gauss mixed model is substituted, solves the problems, such as that gauss hybrid models are inaccurate when carrying out speech characteristic parameter classification,
And be combined adaptive GMM and bilinear frequency bending plus amplitude adjusted, construct a kind of voice conversion system
System.But in the patent document, not beforehand through more effective initial value in algorithm picks search space;Also do not calculating first
When testing probability, by being handled matrix to improve arithmetic speed.
The content of the invention
The purpose of the present invention is to overcome deficiency of the prior art, there is provided a kind of accurate efficient phonetics transfer method.
The purpose of the present invention is achieved by following technical proposals.A kind of phonetics transfer method of the present invention, including under
Row step:
S1:Extract source of sound and the phonetic feature of target sound;
S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;
S3:With the voice of gauss hybrid models and clustering algorithm training after regular;
S4:Extract the voice messaging of source of sound and changed with the data of gained after training, synthesize target sound.
Preferably, in step sl, the source of sound and the phonetic feature of target sound extract from voice data.
Described step S1 is specifically carried out as follows:
S1.1:Fundamental frequency information f0, aperiodic component are extracted from the voice data to prestore respectively using STRAIGHT models
Ap, smooth power spectrum parameter sp;
S1.2:Using SPTK instrument dimensionality reductions, smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc, obtained
To source of sound phonetic matrix X and target sound phonetic matrix Y.
In described step S1.1, prestore source of sound and target sound number, content identical voice data.
Step S2 is specifically realized as follows:
Two matrixes for differing length with dynamic time algorithm are changed into isometric x, y, and are one by two matrixes joint
Individual matrix z.
Described step S3 is realized by following steps:
S3.1:Gauss hybrid models are initialized by matrix z;
S3.2:The Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models.
Described step S3.1 is realized by following steps:
S3.1.1:Matrix dimensionality M is set, Clustering Model number k, cluster average is calculated with K mean algorithms;
S3.1.2:Mixed coefficint is calculated by the number at each cluster midpoint;
S3.1.3:The data matrix C of each average is taken out from z, and the covariance matrix of the cluster is calculated by C.
Described step S3.2 is realized by following steps:
S3.2.1, first according to formula
Wherein, P (Zj|ui,∑i) represent prior probability, zjRepresent joint vector, uiRepresent mean value vector, ΣiRepresent association side
Poor matrix, the transposition of T representing matrixs;
In Ben Shi and following formula, due to the matrix that each variable is M*M*K sizes, so special provide each symbol
Subscript j represent row vector, i represent column vector, r represent r-th of matrix in block form;
Calculate prior probability P (Zj|ui,∑i);
S3.2.2, using Bayes' theorem, by formula
Wherein, αiRepresent the weight coefficient of each component of gauss hybrid models;
Calculate posterior probability λ (eji);
S3.2.3, the posterior probability calculated by above formula, following variable is calculated according to formula
αi(new)=ni/k
Wherein, N represents the quantity of the characteristic parameter of training voice;niRepresent all characteristic vectors in i-th of component
Posterior probability sum, ui(new) mean value vector, the α of i-th of component after renewal are representedi(new) i-th point after renewal is represented
The weight coefficient of amount, ∑i(new) covariance matrix of i-th of component after renewal is represented.
S3.2.4, three step iteration for several times, final weight coefficient α, covariance matrix ∑ will be drawn abovez, cluster average
Matrix uz。
Described step S4 is realized by following steps:
S4.1, with the u obtained after trainingz, ∑z, according to formula
Wherein, uxRepresent the characteristic parameter mean value vector of source of sound, uyRepresent the characteristic parameter mean value vector of target sound, ∑xx
Represent the auto-covariance matrix of sound source parameter, ∑yyRepresent the auto-covariance matrix of target sound characteristic parameter, ∑xyAnd ∑yx
Represent Cross-covariance;Draw source of sound and the cluster mean value vector u of target soundx, uy, draw the auto-covariance of source speaker
∑xx, and the cross covariance ∑ of the twoxy;
S4.2, selects the voice messaging of any source of sound, and extracts the smooth power spectrum information under its STRAIGHT model
Sp ', fundamental frequency information f0 ', aperiodic composition ap ', data matrix x is obtained by step S3.2tCalculate its prior probability P (ci|
xt);
S4.3, pass through transfer function
Wherein, xtRepresent phonetic feature to be converted, ciRepresent i-th of component of gauss hybrid models, " -1 " representing matrix
Inversion operation;Draw the Mel broad sense cepstrum parameter of synthesis voice;
S4.4, the parameter that above formula is calculated are converted into steady power spectrum, with reference to described aperiodic composition ap ', base
Frequency information f0 ', target sound is synthesized by straight models.
Beneficial effect
The present invention has the following advantages compared with prior art:
1. in the first step of initialization gauss hybrid models, because K mean algorithms are realized easily, restrain soon, for big number
It is very fast according to the collection speed of service, so employing K mean algorithms, an initial value more more effective than random initializtion is chosen, so
With regard to the search space of the expectation-maximization algorithm of diminution, its arithmetic speed and precision are improved.
2. because speech data meets Gaussian Profile, so when calculating prior probability, covariance matrix can be passed through
After cholesky is decomposed into diagonal matrix, then calculated, so substantially increase arithmetic speed.
Brief description of the drawings
The invention will be further described below in conjunction with the accompanying drawings.
Fig. 1 is a kind of basic flow sheet of one phonetics transfer method of the embodiment of the present invention;
Fig. 2 is a kind of training flow chart of one phonetics transfer method of the embodiment of the present invention;
Fig. 3 is a kind of conversion synthetic schemes of one phonetics transfer method of the embodiment of the present invention.
Embodiment
It is the specific embodiment of the present invention and with reference to accompanying drawing below, technical scheme is further described,
But the present invention is not limited to these embodiments.
The present invention provides a kind of phonetics transfer method, and the technical principle of the invention is as follows:
A kind of phonetics transfer method of the present invention, have under conditions of voice data abundance, there is provided a kind of accurate
It is efficient to realize the function of method of the sound mapping of source of sound for the sound of target sound.Can be according to source of sound and the sound of target sound
The mathematical characteristics of sound, by being modeled to the two sound and carrying out algorithm computing, the sound of source of sound is accurately converted into target
The sound of sound.
Embodiment one
The present embodiment provides a kind of computational algorithm, is specifically for use in a kind of phonetics transfer method, in the present embodiment, described
Algorithm is needed under conditions of sufficient voice data is possessed, according to the mathematical characteristics of source of sound and the sound of target sound, by two
Person's sound is modeled and carries out algorithm computing, and the sound of source of sound is accurately converted into the sound of target sound.
As shown in figure 1, a kind of described phonetics transfer method, including following steps:
S1:Extract source of sound and the phonetic feature of target sound;Described voice data includes:Source of sound is interior with both target sounds
It is each more than 100 (subject matter is unlimited) to hold the same sentence number identical voice data, and described voice data also includes:Both sound
The mathematical characteristics of sound;
S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;Described step S2 includes:With dynamic
Two matrixes that time algorithm (DTW) differs length are changed into isometric X, Y so that the corresponding linear spectral frequency of source and target exists
There is minimum distortion distance in setting distortion criterion, source of sound and the characteristic sequence of target sound is associated in parameter aspect, and
It is a matrix z by two matrixes joint;
S3:With the voice of gauss hybrid models and clustering algorithm training after regular;Described step S3 includes:By upper
Gauss hybrid models are initialized by the matrix z that one step obtains;With the expectation-maximization algorithm of gauss hybrid models to first
Gauss model after beginningization is calculated, to reach the purpose for being mutually fitted source of sound with the model of target sound;
S4:Extract the voice messaging of source of sound and changed it with the data of gained after training, synthesize target sound.
Specific steps:
1. extract source of sound and the phonetic feature of target sound
1. the same sentence number identical voice data of the content for preparing both each more than 100 (subject matter is unlimited) uses
STRAIGHT models therefrom extract fundamental frequency information f0, aperiodic component ap, smooth power spectrum parameter sp respectively.
2. smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc using SPTK instruments dimensionality reduction, at this moment
To source source of sound phonetic matrix X and target sound phonetic matrix Y.
2. pair both phonetic matrixs carry out dynamic time warping
Two matrixes for differing length with dynamic time algorithm (DTW) are changed into isometric x, y so that corresponding source and
The linear spectral frequency of target has minimum distortion distance in setting distortion criterion, makes the characteristic sequence of source and target people in parameter layer
It is associated on face, and is a matrix z by two matrixes joint
3. with voice of the gauss hybrid models with clustering algorithm training after regular, as shown in Fig. 2 this step is divided into two
Point:
(1) by matrix z obtained in the previous step, gauss hybrid models are initialized, realize that this purpose step is:
1st step, the dimension M of matrix, Clustering Model number k are set, cluster average u is calculated with K mean algorithmsz。
2nd step, mixed coefficint is calculated by the number at each cluster midpoint.
3rd step, the data matrix C of each average is taken out from matrix z, and the covariance square of the cluster is calculated by C
Battle array ∑z。
(2) the Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models, to reach
The purpose that the model of source speaker with target speaker are mutually fitted:
1st step, first according to formula
Wherein, P (Zj|ui,∑i) represent prior probability, zjRepresent joint vector, uiRepresent mean value vector, ΣiRepresent association side
Poor matrix, the transposition of T representing matrixs;Provide that the subscript j of each symbol represents row vector, i represents column vector, and r represents r-th point
Block matrix;
Calculate prior probability P (Zj|ui,∑i);Will association it is noted that having used cholesky to decompose in this course
Variance matrix is decomposed into diagonal matrix, so on the premise of precision is ensured, improves arithmetic speed again.
2nd step, using Bayes' theorem, by formula
Wherein, αiRepresent the weight coefficient of each component of gauss hybrid models;
Calculate posterior probability λ (eji)。
3rd step, the posterior probability calculated by above formula, following variable is calculated according to formula
αi(new)=ni/k
Wherein, N represents the quantity of the characteristic parameter of training voice;niRepresent all characteristic vectors in i-th of component
Posterior probability sum, ui(new) mean value vector, the α of i-th of component after renewal are representedi(new) i-th point after renewal is represented
The weight coefficient of amount, ∑i(new) covariance matrix of i-th of component after renewal is represented.
4th step, 3 step iteration 20 times, draws final weight coefficient α, covariance matrix ∑ by more thanz, cluster average square
Battle array uz。
4. conversion and synthesis phase
1st step, with the u obtained after trainingz, ∑z, according to formula
Wherein, uxRepresent the characteristic parameter mean value vector of source of sound, uyRepresent the characteristic parameter mean value vector of target sound, ∑xx
Represent the auto-covariance matrix of sound source parameter, ∑yyRepresent the auto-covariance matrix of target sound characteristic parameter, ∑xyAnd ∑yx
Represent Cross-covariance;Draw source of sound and the mean value vector u of target soundx, uy, draw the auto-covariance ∑ of source speakerxx, with
And the cross covariance ∑ of the twoxy;
2nd step, it is any to jump the voice messaging of source speaker from one, and extract the smooth power under its STRAIGHT model
Spectrum information sp ', fundamental frequency information f0 ', aperiodic composition ap ', then as previously mentioned method obtains data matrix xtCalculate
Its prior probability P (ci|xt)。
3rd step, passes through transfer function
Wherein, xtRepresent phonetic feature to be converted, ciRepresent i-th of component of gauss hybrid models, " -1 " representing matrix
Inversion operation;Draw the Mel broad sense cepstrum parameter of synthesis voice;
4th step, the parameter that above formula is calculated are converted into steady power spectrum, with reference to aperiodic composition ap ' above, base
Frequency information f0 ', target voice is synthesized by straight models.
In the first step of initialization gauss hybrid models, because K mean algorithms are realized easily, restrain soon, for big data
It is very fast to collect the speed of service, so employing K mean algorithms, chooses an initial value more more effective than random initializtion, thus
The search space of the expectation-maximization algorithm of diminution, improve its arithmetic speed and precision.
On the other hand, because speech data meets Gaussian Profile, so when calculating prior probability, can be by covariance square
After battle array is decomposed into diagonal matrix by cholesky, then calculated, so substantially increase arithmetic speed.
The present invention can be according to the mathematical characteristics spoken of source and target speaker, by being modeled simultaneously to the two voice
Algorithm computing is carried out, the voice of source speaker is accurately converted into the voice of target speaker.Algorithm provided by the present invention exists
While reducing amount of calculation, the accuracy of voice conversion is improved.
The general principle and principal character and advantages of the present invention of the present invention has been shown and described above.The technology of the industry
Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the simply explanation described in above-described embodiment and specification is originally
The principle of invention, without departing from the spirit and scope of the present invention, various changes and modifications of the present invention are possible, these changes
Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its
Equivalent thereof.Specific embodiment described herein is only to spirit explanation for example of the invention.Skill belonging to the present invention
The technical staff in art field can make various modifications or supplement to described specific embodiment or use similar side
Formula substitutes, but without departing from the spiritual of the present invention or surmounts scope defined in appended claims.
Claims (9)
1. a kind of phonetics transfer method, it is characterised in that including step:
S1:Extract source of sound and the phonetic feature of target sound;
S2:Dynamic time warping is carried out to source of sound and the voice messaging of target sound;
S3:With the voice of gauss hybrid models and clustering algorithm training after regular;
S4:Extract the voice messaging of source of sound and changed with the data of gained after training, synthesize target sound.
A kind of 2. phonetics transfer method as claimed in claim 1, it is characterised in that:
In step sl, the source of sound and the phonetic feature of target sound extract from voice data.
3. a kind of phonetics transfer method as claimed in claim 1, it is characterised in that step S1 is specifically carried out as follows:
S1.1:Fundamental frequency information f0, aperiodic component ap are extracted from the voice data to prestore respectively using STRAIGHT models, is put down
Sliding power spectrum parameters sp;
S1.2:Using SPTK instrument dimensionality reductions, smooth power spectrum parameter sp is converted into broad sense Mel-cepstrum mgc, obtains sound
Source phonetic matrix X and target sound phonetic matrix Y.
4. a kind of phonetics transfer method as claimed in claim 3, it is characterised in that in step S1.1, prestore source of sound and mesh
Mark with phonetic symbols number, content identical voice data.
5. a kind of phonetics transfer method as described in claim 3 or 4, it is characterised in that step S2 is specific real as follows
It is existing:
Two matrixes for differing length with dynamic time algorithm are changed into isometric x, y, and are a square by two matrixes joint
Battle array z.
6. a kind of phonetics transfer method as claimed in claim 5, it is characterised in that step S3 is realized by following steps:
S3.1:Gauss hybrid models are initialized by matrix z;
S3.2:The Gauss model after initialization is calculated with the expectation-maximization algorithm of gauss hybrid models.
7. a kind of phonetics transfer method as claimed in claim 6, it is characterised in that step S3.1 is realized by following steps:
S3.1.1:Matrix dimensionality M, Clustering Model number k are set, cluster average u is calculated with K mean algorithmsz;
S3.1.2:Mixed coefficint is calculated by the number at each cluster midpoint;
S3.1.3:The data matrix C of each average is taken out from z, and the covariance matrix ∑ z of the cluster is calculated by C.
8. a kind of phonetics transfer method as claimed in claim 7, it is characterised in that step S3.2 is realized by following steps:
S3.2.1, first according to formula
Wherein, P (Zj|ui,∑i) represent prior probability, zjRepresent joint vector, uiRepresent mean value vector, ΣiRepresent covariance square
Battle array, the transposition of T representing matrixs;Provide that the subscript j of each symbol represents row vector, i represents column vector, and r represents r-th of piecemeal square
Battle array;
Calculate prior probability P (Zj|ui,∑i);
S3.2.2, using Bayes' theorem, by formula
Wherein, αiRepresent the weight coefficient of each component of gauss hybrid models;
Calculate posterior probability λ (eji);
S3.2.3, the posterior probability calculated by above formula, following variable is calculated according to formula
<mrow>
<msub>
<mi>n</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</msubsup>
<mi>&lambda;</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>e</mi>
<mrow>
<mi>j</mi>
<mi>i</mi>
</mrow>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<msub>
<mi>u</mi>
<mi>i</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mi>e</mi>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>n</mi>
<mi>i</mi>
</msub>
</mfrac>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</msubsup>
<mi>&lambda;</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>e</mi>
<mrow>
<mi>j</mi>
<mi>i</mi>
</mrow>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>z</mi>
<mi>j</mi>
</msub>
</mrow>
αi(new)=ni/k
<mrow>
<msub>
<mi>&Sigma;</mi>
<mi>i</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mi>e</mi>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>n</mi>
<mi>i</mi>
</msub>
</mfrac>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</msubsup>
<mi>&lambda;</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>e</mi>
<mrow>
<mi>j</mi>
<mi>i</mi>
</mrow>
</msub>
<mo>)</mo>
</mrow>
<mrow>
<mo>(</mo>
<msub>
<mi>z</mi>
<mi>j</mi>
</msub>
<mo>-</mo>
<msub>
<mi>u</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<msup>
<mrow>
<mo>(</mo>
<msub>
<mi>z</mi>
<mi>j</mi>
</msub>
<mo>-</mo>
<msub>
<mi>u</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mi>T</mi>
</msup>
</mrow>
Wherein, N represents the quantity of the characteristic parameter of training voice;niRepresent that posteriority of all characteristic vectors in i-th of component is general
Rate sum, ui(new) mean value vector, the α of i-th of component after renewal are representedi(new) power of i-th of component after renewal is represented
Weight coefficient, ∑i(new) covariance matrix of i-th of component after renewal is represented.
S3.2.4, three step iteration for several times, final weight coefficient α, covariance matrix ∑ z will be drawn above, and cluster Mean Matrix
uz。
9. a kind of phonetics transfer method as claimed in claim 8, it is characterised in that step S4 is realized by following steps:
S4.1, with the u obtained after trainingz, ∑ z, according to formula
<mrow>
<msub>
<mi>u</mi>
<mi>z</mi>
</msub>
<mo>=</mo>
<mfenced open = "[" close = "]">
<mtable>
<mtr>
<mtd>
<msub>
<mi>u</mi>
<mi>x</mi>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>u</mi>
<mi>y</mi>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>,</mo>
<msub>
<mo>&Sigma;</mo>
<mi>z</mi>
</msub>
<mo>=</mo>
<mfenced open = "[" close = "]">
<mtable>
<mtr>
<mtd>
<mrow>
<msub>
<mo>&Sigma;</mo>
<mrow>
<mi>x</mi>
<mi>x</mi>
</mrow>
</msub>
<mo>,</mo>
<msub>
<mo>&Sigma;</mo>
<mrow>
<mi>x</mi>
<mi>y</mi>
</mrow>
</msub>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mrow>
<msub>
<mo>&Sigma;</mo>
<mrow>
<mi>y</mi>
<mi>x</mi>
</mrow>
</msub>
<mo>,</mo>
<msub>
<mo>&Sigma;</mo>
<mrow>
<mi>y</mi>
<mi>y</mi>
</mrow>
</msub>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
</mrow>
Wherein, uxRepresent the characteristic parameter mean value vector of source of sound, uyThe characteristic parameter mean value vector of target sound is represented, ∑ xx is represented
The auto-covariance matrix of sound source parameter, ∑ yy represent the auto-covariance matrix of target sound characteristic parameter, ∑ xy and ∑ yx tables
Show Cross-covariance;Draw source of sound and the mean value vector u of target soundx, uy, the auto-covariance ∑ xx of source speaker is drawn, and
The cross covariance ∑ xy of the two;
S4.2, selects the voice messaging of any source of sound, and extracts the smooth power spectrum information sp ' under its STRAIGHT model, base
Frequency information f0 ', aperiodic composition ap ', data matrix x is obtained by step S3.2tCalculate its prior probability P (ci|xt);
S4.3, pass through transfer function
F(xt)=∑ P (ci|xt)[(uy+∑xx∑xy-1(xt-ux))]
Wherein, xtRepresent phonetic feature to be converted, ciI-th of component of gauss hybrid models is represented, " -1 " representing matrix is asked
Inverse operation;
Draw the Mel broad sense cepstrum parameter of synthesis voice;
S4.4, the parameter that above formula is calculated are converted into steady power spectrum, with reference to described aperiodic composition ap ', fundamental frequency letter
F0 ' is ceased, target sound is synthesized by straight models.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710971228.9A CN107785030B (en) | 2017-10-18 | 2017-10-18 | Voice conversion method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710971228.9A CN107785030B (en) | 2017-10-18 | 2017-10-18 | Voice conversion method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107785030A true CN107785030A (en) | 2018-03-09 |
CN107785030B CN107785030B (en) | 2021-04-30 |
Family
ID=61434640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710971228.9A Active CN107785030B (en) | 2017-10-18 | 2017-10-18 | Voice conversion method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107785030B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097890A (en) * | 2019-04-16 | 2019-08-06 | 北京搜狗科技发展有限公司 | A kind of method of speech processing, device and the device for speech processes |
CN111108558A (en) * | 2019-12-20 | 2020-05-05 | 深圳市优必选科技股份有限公司 | Voice conversion method and device, computer equipment and computer readable storage medium |
CN111564158A (en) * | 2020-04-29 | 2020-08-21 | 上海紫荆桃李科技有限公司 | Configurable sound changing device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN103063899A (en) * | 2012-12-20 | 2013-04-24 | 中国科学院西安光学精密机械研究所 | Sensing optical fiber ring and reflective all-optical fiber current transformer |
CN104091592A (en) * | 2014-07-02 | 2014-10-08 | 常州工学院 | Voice conversion system based on hidden Gaussian random field |
CN105206259A (en) * | 2015-11-03 | 2015-12-30 | 常州工学院 | Voice conversion method |
CN106205623A (en) * | 2016-06-17 | 2016-12-07 | 福建星网视易信息系统有限公司 | A kind of sound converting method and device |
CN107103914A (en) * | 2017-03-20 | 2017-08-29 | 南京邮电大学 | A kind of high-quality phonetics transfer method |
-
2017
- 2017-10-18 CN CN201710971228.9A patent/CN107785030B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN103063899A (en) * | 2012-12-20 | 2013-04-24 | 中国科学院西安光学精密机械研究所 | Sensing optical fiber ring and reflective all-optical fiber current transformer |
CN104091592A (en) * | 2014-07-02 | 2014-10-08 | 常州工学院 | Voice conversion system based on hidden Gaussian random field |
CN105206259A (en) * | 2015-11-03 | 2015-12-30 | 常州工学院 | Voice conversion method |
CN106205623A (en) * | 2016-06-17 | 2016-12-07 | 福建星网视易信息系统有限公司 | A kind of sound converting method and device |
CN107103914A (en) * | 2017-03-20 | 2017-08-29 | 南京邮电大学 | A kind of high-quality phonetics transfer method |
Non-Patent Citations (10)
Title |
---|
李健: "基于GMM的汉语语音转换系统研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
李波: "语音转换的关键技术研究", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 * |
李清华: "语音转换技术研究及实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
杨骋等: "基于简化STRAIGHT模型的语音信号重构", 《指挥信息系统与技术》 * |
简志华等: "语声转换技术发展及展望", 《南京邮电大学学报(自然科学版)》 * |
袁志明: "基于高斯混合模型和K-均值聚类算法的RBF神经网络实现男女声转换", 《黑龙江科技信息》 * |
解伟超: "语音转换中声道谱参数和基频变换算法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
陈先同: "语音转换中特征参数及其转换方法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
马欢: "基于STRAIGHT模型的语音转换的研究", 《电脑与电信》 * |
鲁博: "语音转换技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097890A (en) * | 2019-04-16 | 2019-08-06 | 北京搜狗科技发展有限公司 | A kind of method of speech processing, device and the device for speech processes |
CN110097890B (en) * | 2019-04-16 | 2021-11-02 | 北京搜狗科技发展有限公司 | Voice processing method and device for voice processing |
CN111108558A (en) * | 2019-12-20 | 2020-05-05 | 深圳市优必选科技股份有限公司 | Voice conversion method and device, computer equipment and computer readable storage medium |
CN111108558B (en) * | 2019-12-20 | 2023-08-04 | 深圳市优必选科技股份有限公司 | Voice conversion method, device, computer equipment and computer readable storage medium |
CN111564158A (en) * | 2020-04-29 | 2020-08-21 | 上海紫荆桃李科技有限公司 | Configurable sound changing device |
Also Published As
Publication number | Publication date |
---|---|
CN107785030B (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11482207B2 (en) | Waveform generation using end-to-end text-to-waveform system | |
Saito et al. | One-to-many voice conversion based on tensor representation of speaker space | |
Toda et al. | One-to-many and many-to-one voice conversion based on eigenvoices | |
CN108461079A (en) | A kind of song synthetic method towards tone color conversion | |
CN101178896B (en) | Unit selection voice synthetic method based on acoustics statistical model | |
CN101833951B (en) | Multi-background modeling method for speaker recognition | |
CN104392718B (en) | A kind of robust speech recognition methods based on acoustic model array | |
JP3412496B2 (en) | Speaker adaptation device and speech recognition device | |
CN102306492B (en) | Voice conversion method based on convolutive nonnegative matrix factorization | |
JP2013205697A (en) | Speech synthesizer, speech synthesis method, speech synthesis program and learning device | |
CN110060701A (en) | Multi-to-multi phonetics transfer method based on VAWGAN-AC | |
CN107785030A (en) | A kind of phonetics transfer method | |
CN104217721B (en) | Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns | |
CN107301859A (en) | Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss | |
CN107333238A (en) | A kind of indoor fingerprint method for rapidly positioning based on support vector regression | |
CN103280224A (en) | Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm | |
CN110265051A (en) | The sightsinging audio intelligent scoring modeling method of education is sung applied to root LeEco | |
CN110047501A (en) | Multi-to-multi phonetics transfer method based on beta-VAE | |
CN110085254A (en) | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector | |
CN106847248A (en) | Chord recognition methods based on robustness scale contour feature and vector machine | |
CN103456302A (en) | Emotion speaker recognition method based on emotion GMM model weight synthesis | |
CN109584893A (en) | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition | |
Chien et al. | Evaluation of glottal inverse filtering algorithms using a physiologically based articulatory speech synthesizer | |
CN103413548A (en) | Voice conversion method of united frequency-spectrum modeling based on restricted boltzman machine | |
CN103886859B (en) | Phonetics transfer method based on one-to-many codebook mapping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |