CN107301859B - Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering - Google Patents
Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering Download PDFInfo
- Publication number
- CN107301859B CN107301859B CN201710474281.8A CN201710474281A CN107301859B CN 107301859 B CN107301859 B CN 107301859B CN 201710474281 A CN201710474281 A CN 201710474281A CN 107301859 B CN107301859 B CN 107301859B
- Authority
- CN
- China
- Prior art keywords
- speech
- training
- gaussian
- voice
- conversion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000012549 training Methods 0.000 claims abstract description 56
- 239000000203 mixture Substances 0.000 claims abstract description 15
- 238000010606 normalization Methods 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims abstract description 9
- 230000001755 vocal effect Effects 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 47
- 239000011159 matrix material Substances 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 11
- 230000003595 spectral effect Effects 0.000 claims description 11
- 230000015572 biosynthetic process Effects 0.000 claims description 7
- 238000003786 synthesis reaction Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000007476 Maximum Likelihood Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000012417 linear regression Methods 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 15
- 241000976924 Inca Species 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 3
- 238000005034 decoration Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/032—Quantisation or dequantisation of spectral components
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a voice conversion method under a non-parallel text condition based on self-adaptive Gaussian clustering, and belongs to the technical field of voice signal processing. Firstly, the method based on the combination of unit selection and vocal tract length normalization is utilized to align the voice characteristic parameters of the non-parallel corpus, then the training of self-adaptive Gaussian mixture model and bilinear frequency warping plus amplitude adjustment is carried out to obtain the conversion function required by voice conversion, and finally the conversion function is used to realize high-quality voice conversion. The invention not only overcomes the limitation of parallel corpora required in the training stage, realizes voice conversion under the condition of non-parallel texts, and has stronger applicability and universality, but also uses the self-adaptive Gaussian mixture model to replace the traditional Gaussian mixture model, solves the problem that the Gaussian mixture model is inaccurate when carrying out voice characteristic parameter classification, combines the self-adaptive Gaussian mixture model with bilinear frequency warping and amplitude adjustment, and has better individual similarity and voice quality of conversion.
Description
Technical Field
The invention relates to a voice conversion technology, in particular to a voice conversion method under a non-parallel text condition, and belongs to the technical field of voice signal processing.
Background
Speech conversion is an emerging branch of research in the field of speech signal processing in recent years, and is carried out and developed on the basis of research on speech analysis, recognition and synthesis.
The goal of speech conversion is to change the speech personality characteristics of the source speaker to have the speech personality characteristics of the target speaker, i.e., to make one person speaking speech sound like another person speaking speech after conversion, while preserving semantics.
Most speech conversion methods, especially the GMM-based speech conversion method, require that the corpus used for training is parallel text, i.e. the source speaker and the target speaker need to send out sentences with the same speech content and speech duration, and the pronunciation rhythm and emotion are consistent as much as possible. However, in the practical application of voice conversion, it is very difficult, even impossible, to obtain a large amount of parallel corpora, and in addition, the accuracy of the alignment of the voice feature parameter vectors during training also becomes a constraint on the performance of the voice conversion system. The research of the voice conversion method under the condition of non-parallel texts has great practical significance and application value in consideration of the universality and the practicability of the voice conversion system.
At present, two methods for voice conversion under the condition of non-parallel text mainly exist, namely a method based on voice clustering and a method based on parameter self-adaption. The method based on speech clustering is to select corresponding speech units for conversion by measuring the distance between speech frames or under the guidance of phoneme information, and essentially converts non-parallel texts into parallel texts for processing under certain conditions. The method has a simple principle, but the content of the voice text needs to be pre-extracted, and the pre-extraction result can directly influence the conversion quality of the voice. The parameter self-adapting method is to process the parameters of the conversion model by adopting a speaker normalization or self-adapting method in the speech recognition, and the essence is to convert the pre-established model to the model based on the target speaker. The method can reasonably utilize the pre-stored speaker information, but generally, the self-adaptive process can cause the smoothing of frequency spectrum, which results in the weak personality information of the speaker in the converted voice.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the voice conversion method can adaptively determine the GMM mixing degree according to the difference of target speakers under the condition of non-parallel texts, thereby enhancing the individual characteristics of the speakers in the converted voice and improving the quality of the converted voice.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a voice conversion method under a non-parallel text condition based on self-adaptive Gaussian clustering, which comprises a training stage and a conversion stage, wherein the training stage comprises the following steps:
step 1, inputting non-parallel training corpora of a source speaker and a target speaker;
step 2, respectively extracting MFCC characteristic parameters X of the non-parallel training corpus of the source speaker, MFCC characteristic parameters Y of the non-parallel training corpus of the target speaker and source speech fundamental frequency log f by using an AHOcoder speech analysis model0XAnd the target fundamental frequency log f of speech0Y;
Step 3, performing speech characteristic parameter alignment and dynamic time warping combining unit selection and vocal tract length normalization on the MFCC characteristic parameters X, Y in the step 2, thereby converting the non-parallel linguistic data into parallel linguistic data;
step 4, performing adaptive Gaussian mixture model AGMM training by using an expectation maximization EM algorithm, finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing an AGMM parameter lambda;
step 5, utilizing the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in the step 3, utilizing the posterior conditional probability matrix P (X | Lambda) in the step 4 to carry out bilinear frequency warping BLFW + amplitude adjustment AS training to obtain frequency warping factors α (X, Lambda) and amplitude adjustment factors s (X, Lambda) so AS to construct a BLFW + AS conversion function, and utilizing the mean value and the variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f0XAnd the target fundamental frequency log f of speech0YA fundamental frequency transfer function therebetween;
the conversion phase comprises the following steps:
step 6, inputting the voice of a source speaker to be converted;
step 7, extracting MFCC characteristic parameters X' and logarithmic fundamental frequency log f of source speaker voice by using AHOcoder voice analysis model0X′;
Step 8, using the parameter lambda obtained in the AGMM training in the step 4 to obtain a posterior conditional probability matrix P' (X | lambda);
step 9, obtaining a converted MFCC characteristic parameter Y' by using the BLFW + AS conversion function obtained in the step 5;
step 10, using the base frequency conversion function obtained in step 5 to convert the base frequency into the logarithmic base frequency log f0X′Obtaining the converted logarithmic fundamental frequency log f0Y′;
Step 11, using AHOdecoder speech synthesis model to convert the converted MFCC characteristic parameters Y' and the logarithmic fundamental frequency log f0Y′The synthesis results in a converted speech.
Further, the voice conversion method provided by the present invention comprises the following specific steps in step 3:
3-1) carrying out sound channel length normalization processing on the source speech MFCC characteristic parameters by adopting a bilinear frequency warping method;
3-2) MFCC feature parameter vector { X) for a given N source voiceskThe N target voice characteristic parameter vectors { Y } are dynamically searched through the formula (1)kLet the distance cost function value C ({ Y)k}) minimum;
C({Yk})=C1({Yk})+C2({Yk}) (1)
wherein, C1({Yk}) and C2({Yk}) are respectively represented by the following formulae:
wherein, D (X)k,Yk) The function represents the spectral distance between the characteristic parameter vectors of source speech and target speech, the parameter gamma represents the balance coefficient between the accuracy of the characteristic parameter frame alignment and the continuity between frames, and gamma is more than or equal to 0 and less than or equal to 1; c1({Yk}) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, C2({Yk}) represents a spectral distance cost function between target speech characteristic parameter vectors selected by the unit;
3-3) by performing a multivariate linear regression on the formula (1)Analyzing to obtain a target speech characteristic parameter sequence set aligned with the source speech characteristic parameter vectorNamely:
through the above steps, the non-parallel MFCC characteristic X, Y is transformed into parallel corpora.
Further, the speech conversion method proposed by the present invention optimizes the execution efficiency of the algorithm by using the viterbi search method for the solution of the formula (4).
Further, the training process of step 4 of the speech conversion method provided by the present invention is as follows:
4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t1,t2The Euclidean distance threshold D and the covariance threshold sigma between the characteristic parameter vectors;
4-2) obtaining an initial value of EM training by using a K-mean iterative algorithm;
4-3) carrying out iterative training by using an EM algorithm; the gaussian mixture model GMM is expressed as follows:
wherein, X is a voice characteristic parameter vector of P dimension, and P is 39; p (w)i) Weight coefficients representing the respective Gaussian components, and havingM is the number of Gaussian components, N (X, μ)i,Σi) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:
wherein muiIs a mean vector, ΣiIs a covariance matrix, λ ═ P (w)i),μi,ΣiLambda is a model parameter of the GMM model, and the estimation of the lambda is realized by a maximum likelihood estimation method, and for a speech characteristic parameter vector set X ═ XnN ═ 1,2,. N } has:
at this time:
λ=argλmax(P(X|λ)) (8)
solving equation (8) using the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation processk)≥P(X|λk-1),
K is the number of iterations until the model parameter lambda, the weight coefficient P of the Gaussian component in the iteration process (w)i) Mean vector μiCovariance matrix sigmaiThe iterative formula of (a) is as follows:
4-4) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient being less than t1And its nearest neighbor component N (P (w)j),μj,Σi) If the Euclidean distance between the two is smaller than the threshold value D, merging the two:
at the moment, the number of the Gaussian components is changed into M-1, the step 4-3) is returned to for next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected for merging;
4-5) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient is greater than t2And if the variance of at least one dimension in the covariance matrix is larger than sigma, the Gaussian component is considered to contain excessive information, and the Gaussian component is split:
e is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components is changed into M +1 after splitting, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step 4-3) is returned for next training;
4-6) finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing the lambda.
Further, the BLFW + AS conversion function constructed in step 5 of the voice conversion method proposed by the present invention is expressed AS follows:
F(x)=Wα(x,λ)x+s(x,λ) (15)
wherein M is the number of gaussian components of the mixture gaussian model in step 4, α (x, λ) represents the frequency warping factor, and s (x, λ) represents the amplitude adjustment factor.
Further, the speech conversion method provided by the present invention establishes a conversion relationship between the source speech pitch frequency and the target speech pitch frequency in step 5:
wherein μ, σ2Respectively representing logarithmic pitch frequency log f0Mean and variance of.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. the invention realizes the voice conversion under the condition of non-parallel texts, solves the problem that parallel linguistic data are not easy to obtain, and improves the universality and the practicability of the voice conversion system.
2. The invention uses the combination of AGMM and BLFW + AS to realize a voice conversion system, and the system can adaptively adjust the classification number of GMM according to the voice characteristic parameter distribution of different speakers, improves the voice quality while enhancing the voice personality similarity, and realizes high-quality voice conversion.
Drawings
FIG. 1 is a schematic illustration of the non-parallel text to speech conversion of the present invention.
FIG. 2 is a flow chart of adaptive Gaussian mixture model training.
FIG. 3 is a comparison of the speech spectra of the converted speech.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The high-quality voice conversion method of the invention is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.
As shown in fig. 1, the training part implements the steps of:
step 1, inputting the voice non-parallel corpora of a source speaker and a target speaker, wherein the non-parallel corpora is taken from a CMU _ US _ ARCTIC corpus, the corpus is established by the language technology research institute of Chimeron university in a card, the voice in the corpus is recorded by 5 men and 2 women, and each speaker records 1132 sections of 1-6 s of unequal voice.
Step 2, the invention uses AHOcoder speech analysis model to respectively extract Mel-Frequency Cepstral Coefficient (MFCC) X, Y and logarithmic base audio Frequency parameter logf of source speaker and target speaker0XAnd log f0Y. Wherein AHOcoder is a high-performance voice analysis and synthesis tool constructed by the group of Aholoab Signal processing laboratory scholars of Bilbao (Bilbao) Spanish;
and 3, performing voice characteristic parameter alignment and Dynamic Time Warping (DTW) combining unit selection (UnitSelection) and Vocal Tract Length normalization (VTLN, Vocal Tract Length No. 6 normalization) on the MFCC parameters X, Y of the source and target voices in the step 2. The specific process of aligning the voice characteristic parameters is as follows:
3-1) performing sound channel length normalization processing on the source speech characteristic parameters by adopting a bilinear frequency warping method to enable formants of the source speech to be close to target speech, thereby increasing the accuracy of unit selection of the target speech characteristic parameters.
3-2) for a given N source speech feature parameter vectors { XkThe N target voice characteristic parameter vectors { Y } can be dynamically searched through the formula (1)kLet the distance cost function value C ({ Y)k}) is smallest. Two factors are considered in the cell selection process: one aspect is to ensure that the spectral distance between the aligned source speech feature parameter vector and the feature parameter vector of the target speech is minimized to enhance the phonemeMatching degree of information; and on the other hand, the selected target speech feature parameter vector is ensured to have frame continuity, so that the phoneme information is more complete.
C({Yk})=C1({Yk})+C2({Yk}) (1)
Wherein, C1({Yk}) and C2({Yk}) may be respectively represented by the following formulae:
wherein, D (X)k,Yk) The function represents the spectral distance between the source and target characteristic parameter vectors, and the Euclidean distance is used as a distance measurement scale in the invention. The parameter gamma represents a balance coefficient between accuracy of feature parameter frame alignment and continuity between frames, and has 0. ltoreq. gamma. ltoreq.1. C1({Yk}) is the spectral distance cost function between the source speech feature parameter vector and the feature parameter vector of the target speech, C2({YkAnd) represents a spectral distance cost function between feature parameter vectors of the unit-selected target speech.
3-3) obtaining a characteristic parameter sequence set aligned with the source speech characteristic parameter vector by carrying out multiple linear regression analysis on the formula (1)Namely:
for the solution of equation (4), a Viterbi (Viterbi) search method may be used to optimize the execution efficiency of the algorithm.
The non-parallel MFCC parameters X, Y are transformed to be parallel through the steps described above.
And 4, establishing an Adaptive Gaussian Mixture Model (AGMM), training by adopting an Expectation-Maximization (EM) algorithm, and obtaining an initial value of EM training by using a K-mean iteration method. The AGMM parameters λ, P (X | λ) are obtained by training.
As shown in fig. 2, training the AGMM parameters by using the adaptive clustering algorithm first requires a comprehensive analysis of the euclidean distances between the weight coefficients, the mean vectors, the covariance matrices, and the feature parameter vectors of each gaussian component, and dynamically adjusts the gaussian mixture. The training process is as follows:
4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t1,t2The Euclidean distance threshold D and the covariance threshold sigma between the feature parameter vectors.
4-2) obtaining an initial value of the EM training by using a K-mean iterative algorithm.
4-3) iterative training using EM algorithm.
The conventional gaussian mixture model is represented as follows:
wherein, X is the speech characteristic parameter vector of P dimension, the invention adopts P ═ 39, P (w)i) Weight coefficients representing the respective Gaussian components, and havingM is the number of Gaussian components, N (X, μ)i,∑i) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:
wherein muiIs a mean vector, ΣiIs a covariance matrix, λ ═ P (w)i),μi,ΣiIs the model parameter of the GMM model, and the estimation of the lambda can be realized by a Maximum Likelihood estimation method (ML, Maximum Likelihood estimation)To maximize the conditional probability P (X | λ), for the speech feature parameter vector set X ═ XnN ═ 1,2,. N } has:
at this time:
λ=argλmax(P(X|λ)) (8)
solving equation (8) may use the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation processk)≥P(X|λk-1) And K is the number of iterations until the model parameter λ. Weight coefficient P (w) of Gaussian component in iterative processi) Mean vector μiCovariance matrix sigmaiThe iterative formula of (a) is as follows:
4-4) some Gaussian component N (P (w) in the model obtained by trainingi),μi,Σi) The weight coefficient being less than t1And its nearest neighbor component N (P (w)j),μj,Σi) If the euclidean distance between the two components is smaller than the threshold D, the two components are considered to contain less information and have similar components, and the two components can be merged:
at this time, the number of Gaussian components is changed to M-1, the process returns to the step (3) to carry out the next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected to be merged.
4-5) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient is greater than t2And if the variance of at least one dimension in the covariance matrix (the element on the diagonal of the covariance matrix is the variance) is larger than σ, the gaussian component is considered to contain excessive information, and the process should be split:
and (3) wherein E is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components after splitting is changed into M +1, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step (3) is returned for next training.
4-6) finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing the lambda.
Step 5, training by using the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in step 3 and the posterior conditional probability matrix P (X | λ) obtained in step 4 to obtain a Frequency Warping factor and an amplitude adjusting factor, thereby constructing Bilinear Frequency Warping (BLFW) and amplitude Adjusting (AS) speech conversion functions, which are expressed AS follows:
F(x)=Wα(x,λ)x+s(x,λ) (15)
establishing a conversion relation between the source voice pitch frequency and the target voice pitch frequency:
wherein μ, σ2For representing logarithmic pitch frequency log f0Mean and variance of.
As shown in fig. 1, the conversion part includes the following steps:
step 6, inputting the voice of a source speaker to be converted;
step 7, extracting 39-order MFCC characteristic parameters X' of the source speaker voice and source voice logarithmic fundamental tone frequency parameters log f by using an AHOdecoder voice analysis model0X′;
Step 8, using λ ═ P (w) obtained during AGMM training in step 4i),μi,ΣiSubstituting the characteristic parameters X 'extracted in the step 7 into a formula (5) to obtain a posterior conditional probability matrix P' (X | lambda);
step 9, respectively substituting the frequency warping factor α (X, λ) and the amplitude adjusting factor s (X, λ) obtained by BLFW + AS training in step 5 and the posterior conditional probability matrix P '(X | λ) obtained in step 8 into formulas (15), (16), (17) and (18) to obtain an MFCC characteristic parameter Y' of the converted speech;
step 10, using the source speech logarithmic pitch frequency parameter log f obtained in step 70X′Substituting the result into formula (19) to obtain the logarithmic pitch frequency parameter log f of the converted speech0Y′;
Step 11, using AHOdecoder speech synthesis model to combine Y' in step 9 and log f in step 100Y′The converted speech is obtained as input.
Further, as shown in fig. 3, the spectrogram of the converted speech obtained by the method of the present invention is compared with the INCA method, and the conversion direction is F1-M2 (female voice 1-male voice 2), which further verifies the advantage of higher spectrum similarity of the method adopted by the present invention compared with the INCA method. Among them, the INCA method is proposed in the literature (Erro D, Moreno A, Bonafontea. INCA algorithm for transforming voice conversion system from nonparallelcorra [ J ]. IEEE Transactions on Audio, Speech, and Language Processing,2010,18(5): 944-953.).
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (7)
1. A speech conversion method under the condition of non-parallel texts based on adaptive Gaussian clustering is characterized by comprising a training stage and a conversion stage, wherein the training stage comprises the following steps:
step 1, inputting non-parallel training corpora of a source speaker and a target speaker;
step 2, respectively extracting MFCC characteristic parameters X of the non-parallel training corpus of the source speaker, MFCC characteristic parameters Y of the non-parallel training corpus of the target speaker and source speech fundamental frequency log f by using an AHOcoder speech analysis model0XAnd the target fundamental frequency log f of speech0Y;
Step 3, performing speech characteristic parameter alignment and dynamic time warping combining unit selection and vocal tract length normalization on the MFCC characteristic parameters X, Y in the step 2, thereby converting the non-parallel linguistic data into parallel linguistic data;
step 4, performing adaptive Gaussian mixture model AGMM training by using an expectation maximization EM algorithm, finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing an AGMM parameter lambda;
step 5, utilizing the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in the step 3, utilizing the posterior conditional probability matrix P (X | Lambda) in the step 4 to carry out bilinear frequency warping BLFW + amplitude adjustment AS training to obtain frequency warping factors α (X, Lambda) and amplitude adjustment factors s (X, Lambda) so AS to construct a BLFW + AS conversion function, and utilizing the mean value and the variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f0XAnd the target fundamental frequency log f of speech0YA fundamental frequency transfer function therebetween;
the conversion phase comprises the following steps:
step 6, inputting the voice of a source speaker to be converted;
step 7, extracting MFCC characteristic parameters X' and logarithmic fundamental frequency log f of source speaker voice by using AHOcoder voice analysis model0X′;
Step 8, using the parameter lambda obtained in the AGMM training in the step 4 to obtain a posterior conditional probability matrix P' (X | lambda);
step 9, obtaining a converted MFCC characteristic parameter Y' by using the BLFW + AS conversion function obtained in the step 5;
step 10, using the base frequency conversion function obtained in step 5 to convert the base frequency into the logarithmic base frequency log f0X′Obtaining the converted logarithmic fundamental frequency log f0Y′;
Step 11, using AHOdecoder speech synthesis model to convert the converted MFCC characteristic parameters Y' and logarithmic fundamental frequency logf0Y′The synthesis results in a converted speech.
2. The speech conversion method according to claim 1, wherein the step 3 comprises the following steps:
3-1) carrying out sound channel length normalization processing on the source speech MFCC characteristic parameters by adopting a bilinear frequency warping method;
3-2) MFCC feature parameter vector { X) for a given N source voiceskThe N target voice characteristic parameter vectors { Y } are dynamically searched through the formula (1)kLet the distance cost function value C ({ Y)k}) minimum;
C({Yk})=C1({Yk})+C2({Yk}) (1)
wherein, C1({Yk}) and C2({Yk}) are respectively represented by the following formulae:
wherein, D (X)k,Yk) The function represents the spectral distance, D (Y), between the source speech and target speech feature parameter vectorsk,Yk-1) The function represents the spectral distance between the target voice characteristic parameter vectors selected by the unit, the parameter gamma represents the balance coefficient between the accuracy of characteristic parameter frame alignment and the continuity between frames, and gamma is more than or equal to 0 and less than or equal to 1; c1({Yk}) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, C2({Yk}) represents a spectral distance cost function between target speech characteristic parameter vectors selected by the unit;
3-3) obtaining a target speech characteristic parameter sequence set aligned with the source speech characteristic parameter vector by carrying out multiple linear regression analysis on the formula (1)Namely:
through the above steps, the MFCC feature X, Y in the non-parallel corpus is transformed into the alignment feature parameter set in the similar parallel corpus.
3. The speech conversion method according to claim 2, wherein for the solution of equation (4), a viterbi search method is used to optimize the execution efficiency of the algorithm.
4. The speech conversion method of claim 1, wherein the training process of step 4 is as follows:
4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t1,t2The Euclidean distance threshold D and the covariance threshold sigma between the characteristic parameter vectors;
4-2) obtaining an initial value of EM training by using a K-mean iterative algorithm;
4-3) carrying out iterative training by using an EM algorithm; the gaussian mixture model GMM is expressed as follows:
wherein X is a P-dimensional speech feature parameter vector, P (w)i) Weight coefficients representing the respective Gaussian components, and havingM is the number of Gaussian components, N (X, μ)i,Σi) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:
wherein muiIs a mean vector, ΣiIs a covariance matrix, λ ═ P (w)i),μi,ΣiLambda is a model parameter of the GMM model, and the estimation of the lambda is realized by a maximum likelihood estimation method, and for a speech characteristic parameter vector set X ═ XnN ═ 1,2,. N } has:
at this time:
λ=argλmax(P(X|λ)) (8)
solving equation (8) using the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation processk)≥P(X|λk-1),
k is the number of iterations until a model parameter λ is obtained, the weight coefficient of the Gaussian component P (w) in the iteration processi) Mean vector μiCovariance matrix sigmaiThe iterative formula of (a) is as follows:
4-4) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient being less than t1And its nearest neighbor component N (P (w)j),μj,Σi) If the Euclidean distance between the two is smaller than the threshold value D, merging the two:
at the moment, the number of the Gaussian components is changed into M-1, the step 4-3) is returned to for next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected for merging;
4-5) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient is greater than t2And if the variance of at least one dimension in the covariance matrix is larger than sigma, the Gaussian component is considered to contain excessive information, and the Gaussian component is split:
e is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components is changed into M +1 after splitting, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step 4-3) is returned for next training;
4-6) finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing the lambda.
5. The method of claim 4, wherein P-39.
6. The speech conversion method of claim 1, wherein the BLFW + AS conversion function constructed in step 5 is expressed AS follows:
F(x)=Wα(x,λ)x+s(x,λ) (15)
wherein M is the number of Gaussian components of the Gaussian mixture model in the step 4,representing the posterior probability of the mth Gaussian component of the Speech feature vector x in the AGMM model λ, αmAnd smRespectively, the frequency warping factor and the amplitude adjustment factor of the mth gaussian component in the AGMM model λ, α (x, λ) represents a weighted combination of the frequency warping factors of all gaussian components of the AGMM model λ, and s (x, λ) represents a weighted combination of the amplitude adjustment factors of all gaussian components of the AGMM model λ.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710474281.8A CN107301859B (en) | 2017-06-21 | 2017-06-21 | Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710474281.8A CN107301859B (en) | 2017-06-21 | 2017-06-21 | Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107301859A CN107301859A (en) | 2017-10-27 |
CN107301859B true CN107301859B (en) | 2020-02-21 |
Family
ID=60136451
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710474281.8A Active CN107301859B (en) | 2017-06-21 | 2017-06-21 | Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107301859B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107945791B (en) * | 2017-12-05 | 2021-07-20 | 华南理工大学 | Voice recognition method based on deep learning target detection |
CN108198566B (en) * | 2018-01-24 | 2021-07-20 | 咪咕文化科技有限公司 | Information processing method and device, electronic device and storage medium |
CN108777140B (en) * | 2018-04-27 | 2020-07-28 | 南京邮电大学 | Voice conversion method based on VAE under non-parallel corpus training |
CN109671423B (en) * | 2018-05-03 | 2023-06-02 | 南京邮电大学 | Non-parallel text-to-speech conversion method under limited training data |
CN110556092A (en) * | 2018-05-15 | 2019-12-10 | 中兴通讯股份有限公司 | Speech synthesis method and device, storage medium and electronic device |
US11605371B2 (en) * | 2018-06-19 | 2023-03-14 | Georgetown University | Method and system for parametric speech synthesis |
CN109377978B (en) * | 2018-11-12 | 2021-01-26 | 南京邮电大学 | Many-to-many speaker conversion method based on i vector under non-parallel text condition |
CN109326283B (en) * | 2018-11-23 | 2021-01-26 | 南京邮电大学 | Many-to-many voice conversion method based on text encoder under non-parallel text condition |
CN109671442B (en) * | 2019-01-14 | 2023-02-28 | 南京邮电大学 | Many-to-many speaker conversion method based on STARGAN and x vectors |
CN110164463B (en) * | 2019-05-23 | 2021-09-10 | 北京达佳互联信息技术有限公司 | Voice conversion method and device, electronic equipment and storage medium |
CN110782908B (en) * | 2019-11-05 | 2020-06-16 | 广州欢聊网络科技有限公司 | Audio signal processing method and device |
CN111640453B (en) * | 2020-05-13 | 2023-06-16 | 广州国音智能科技有限公司 | Spectrogram matching method, device, equipment and computer readable storage medium |
US20210383790A1 (en) * | 2020-06-05 | 2021-12-09 | Google Llc | Training speech synthesis neural networks using energy scores |
CN113112999B (en) * | 2021-05-28 | 2022-07-12 | 宁夏理工学院 | Short word and sentence voice recognition method and system based on DTW and GMM |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003022088A (en) * | 2001-07-10 | 2003-01-24 | Sharp Corp | Device and method for speaker's features extraction, voice recognition device, and program recording medium |
CN101751921A (en) * | 2009-12-16 | 2010-06-23 | 南京邮电大学 | Real-time voice conversion method under conditions of minimal amount of training data |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103280224A (en) * | 2013-04-24 | 2013-09-04 | 东南大学 | Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm |
-
2017
- 2017-06-21 CN CN201710474281.8A patent/CN107301859B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003022088A (en) * | 2001-07-10 | 2003-01-24 | Sharp Corp | Device and method for speaker's features extraction, voice recognition device, and program recording medium |
CN101751921A (en) * | 2009-12-16 | 2010-06-23 | 南京邮电大学 | Real-time voice conversion method under conditions of minimal amount of training data |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103280224A (en) * | 2013-04-24 | 2013-09-04 | 东南大学 | Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN107301859A (en) | 2017-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107301859B (en) | Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering | |
Liu et al. | Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance. | |
Sun et al. | Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams. | |
Chen et al. | Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding. | |
US7957959B2 (en) | Method and apparatus for processing speech data with classification models | |
Siu et al. | Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery | |
US9355642B2 (en) | Speaker recognition method through emotional model synthesis based on neighbors preserving principle | |
US8423364B2 (en) | Generic framework for large-margin MCE training in speech recognition | |
CN107103914B (en) | High-quality voice conversion method | |
Wang et al. | Accent and speaker disentanglement in many-to-many voice conversion | |
Thai et al. | Synthetic data augmentation for improving low-resource ASR | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
Hanilçi et al. | Comparison of the impact of some Minkowski metrics on VQ/GMM based speaker recognition | |
CN107068165B (en) | Voice conversion method | |
Metze et al. | The 2010 CMU GALE speech-to-text system. | |
Miyajima et al. | A new approach to designing a feature extractor in speaker identification based on discriminative feature extraction | |
Dusan | Estimation of speaker's height and vocal tract length from speech signal. | |
Li et al. | The Hokkien isolated word recognition system based on FPGA | |
Gonzalez-Rodriguez | Speaker recognition using temporal contours in linguistic units: The case of formant and formant-bandwidth trajectories | |
Thomson et al. | Use of voicing features in HMM-based speech recognition | |
Ma et al. | Speaker cluster based GMM tokenization for speaker recognition. | |
Laskar et al. | HiLAM-state discriminative multi-task deep neural network in dynamic time warping framework for text-dependent speaker verification | |
CN108510995B (en) | Identity information hiding method facing voice communication | |
Song et al. | Experimental study of discriminative adaptive training and MLLR for automatic pronunciation evaluation | |
Qiao et al. | HMM-based sequence-to-frame mapping for voice conversion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |