CN107301859B - Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering - Google Patents

Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering Download PDF

Info

Publication number
CN107301859B
CN107301859B CN201710474281.8A CN201710474281A CN107301859B CN 107301859 B CN107301859 B CN 107301859B CN 201710474281 A CN201710474281 A CN 201710474281A CN 107301859 B CN107301859 B CN 107301859B
Authority
CN
China
Prior art keywords
speech
training
gaussian
voice
conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710474281.8A
Other languages
Chinese (zh)
Other versions
CN107301859A (en
Inventor
李燕萍
左宇涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201710474281.8A priority Critical patent/CN107301859B/en
Publication of CN107301859A publication Critical patent/CN107301859A/en
Application granted granted Critical
Publication of CN107301859B publication Critical patent/CN107301859B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice conversion method under a non-parallel text condition based on self-adaptive Gaussian clustering, and belongs to the technical field of voice signal processing. Firstly, the method based on the combination of unit selection and vocal tract length normalization is utilized to align the voice characteristic parameters of the non-parallel corpus, then the training of self-adaptive Gaussian mixture model and bilinear frequency warping plus amplitude adjustment is carried out to obtain the conversion function required by voice conversion, and finally the conversion function is used to realize high-quality voice conversion. The invention not only overcomes the limitation of parallel corpora required in the training stage, realizes voice conversion under the condition of non-parallel texts, and has stronger applicability and universality, but also uses the self-adaptive Gaussian mixture model to replace the traditional Gaussian mixture model, solves the problem that the Gaussian mixture model is inaccurate when carrying out voice characteristic parameter classification, combines the self-adaptive Gaussian mixture model with bilinear frequency warping and amplitude adjustment, and has better individual similarity and voice quality of conversion.

Description

Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering
Technical Field
The invention relates to a voice conversion technology, in particular to a voice conversion method under a non-parallel text condition, and belongs to the technical field of voice signal processing.
Background
Speech conversion is an emerging branch of research in the field of speech signal processing in recent years, and is carried out and developed on the basis of research on speech analysis, recognition and synthesis.
The goal of speech conversion is to change the speech personality characteristics of the source speaker to have the speech personality characteristics of the target speaker, i.e., to make one person speaking speech sound like another person speaking speech after conversion, while preserving semantics.
Most speech conversion methods, especially the GMM-based speech conversion method, require that the corpus used for training is parallel text, i.e. the source speaker and the target speaker need to send out sentences with the same speech content and speech duration, and the pronunciation rhythm and emotion are consistent as much as possible. However, in the practical application of voice conversion, it is very difficult, even impossible, to obtain a large amount of parallel corpora, and in addition, the accuracy of the alignment of the voice feature parameter vectors during training also becomes a constraint on the performance of the voice conversion system. The research of the voice conversion method under the condition of non-parallel texts has great practical significance and application value in consideration of the universality and the practicability of the voice conversion system.
At present, two methods for voice conversion under the condition of non-parallel text mainly exist, namely a method based on voice clustering and a method based on parameter self-adaption. The method based on speech clustering is to select corresponding speech units for conversion by measuring the distance between speech frames or under the guidance of phoneme information, and essentially converts non-parallel texts into parallel texts for processing under certain conditions. The method has a simple principle, but the content of the voice text needs to be pre-extracted, and the pre-extraction result can directly influence the conversion quality of the voice. The parameter self-adapting method is to process the parameters of the conversion model by adopting a speaker normalization or self-adapting method in the speech recognition, and the essence is to convert the pre-established model to the model based on the target speaker. The method can reasonably utilize the pre-stored speaker information, but generally, the self-adaptive process can cause the smoothing of frequency spectrum, which results in the weak personality information of the speaker in the converted voice.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the voice conversion method can adaptively determine the GMM mixing degree according to the difference of target speakers under the condition of non-parallel texts, thereby enhancing the individual characteristics of the speakers in the converted voice and improving the quality of the converted voice.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a voice conversion method under a non-parallel text condition based on self-adaptive Gaussian clustering, which comprises a training stage and a conversion stage, wherein the training stage comprises the following steps:
step 1, inputting non-parallel training corpora of a source speaker and a target speaker;
step 2, respectively extracting MFCC characteristic parameters X of the non-parallel training corpus of the source speaker, MFCC characteristic parameters Y of the non-parallel training corpus of the target speaker and source speech fundamental frequency log f by using an AHOcoder speech analysis model0XAnd the target fundamental frequency log f of speech0Y
Step 3, performing speech characteristic parameter alignment and dynamic time warping combining unit selection and vocal tract length normalization on the MFCC characteristic parameters X, Y in the step 2, thereby converting the non-parallel linguistic data into parallel linguistic data;
step 4, performing adaptive Gaussian mixture model AGMM training by using an expectation maximization EM algorithm, finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing an AGMM parameter lambda;
step 5, utilizing the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in the step 3, utilizing the posterior conditional probability matrix P (X | Lambda) in the step 4 to carry out bilinear frequency warping BLFW + amplitude adjustment AS training to obtain frequency warping factors α (X, Lambda) and amplitude adjustment factors s (X, Lambda) so AS to construct a BLFW + AS conversion function, and utilizing the mean value and the variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f0XAnd the target fundamental frequency log f of speech0YA fundamental frequency transfer function therebetween;
the conversion phase comprises the following steps:
step 6, inputting the voice of a source speaker to be converted;
step 7, extracting MFCC characteristic parameters X' and logarithmic fundamental frequency log f of source speaker voice by using AHOcoder voice analysis model0X′
Step 8, using the parameter lambda obtained in the AGMM training in the step 4 to obtain a posterior conditional probability matrix P' (X | lambda);
step 9, obtaining a converted MFCC characteristic parameter Y' by using the BLFW + AS conversion function obtained in the step 5;
step 10, using the base frequency conversion function obtained in step 5 to convert the base frequency into the logarithmic base frequency log f0X′Obtaining the converted logarithmic fundamental frequency log f0Y′
Step 11, using AHOdecoder speech synthesis model to convert the converted MFCC characteristic parameters Y' and the logarithmic fundamental frequency log f0Y′The synthesis results in a converted speech.
Further, the voice conversion method provided by the present invention comprises the following specific steps in step 3:
3-1) carrying out sound channel length normalization processing on the source speech MFCC characteristic parameters by adopting a bilinear frequency warping method;
3-2) MFCC feature parameter vector { X) for a given N source voiceskThe N target voice characteristic parameter vectors { Y } are dynamically searched through the formula (1)kLet the distance cost function value C ({ Y)k}) minimum;
C({Yk})=C1({Yk})+C2({Yk}) (1)
wherein, C1({Yk}) and C2({Yk}) are respectively represented by the following formulae:
Figure BDA0001327850060000031
Figure BDA0001327850060000032
wherein, D (X)k,Yk) The function represents the spectral distance between the characteristic parameter vectors of source speech and target speech, the parameter gamma represents the balance coefficient between the accuracy of the characteristic parameter frame alignment and the continuity between frames, and gamma is more than or equal to 0 and less than or equal to 1; c1({Yk}) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, C2({Yk}) represents a spectral distance cost function between target speech characteristic parameter vectors selected by the unit;
3-3) by performing a multivariate linear regression on the formula (1)Analyzing to obtain a target speech characteristic parameter sequence set aligned with the source speech characteristic parameter vector
Figure BDA0001327850060000033
Namely:
Figure BDA0001327850060000034
through the above steps, the non-parallel MFCC characteristic X, Y is transformed into parallel corpora.
Further, the speech conversion method proposed by the present invention optimizes the execution efficiency of the algorithm by using the viterbi search method for the solution of the formula (4).
Further, the training process of step 4 of the speech conversion method provided by the present invention is as follows:
4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t1,t2The Euclidean distance threshold D and the covariance threshold sigma between the characteristic parameter vectors;
4-2) obtaining an initial value of EM training by using a K-mean iterative algorithm;
4-3) carrying out iterative training by using an EM algorithm; the gaussian mixture model GMM is expressed as follows:
Figure BDA0001327850060000035
wherein, X is a voice characteristic parameter vector of P dimension, and P is 39; p (w)i) Weight coefficients representing the respective Gaussian components, and having
Figure BDA0001327850060000036
M is the number of Gaussian components, N (X, μ)ii) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:
Figure BDA0001327850060000037
wherein muiIs a mean vector, ΣiIs a covariance matrix, λ ═ P (w)i),μiiLambda is a model parameter of the GMM model, and the estimation of the lambda is realized by a maximum likelihood estimation method, and for a speech characteristic parameter vector set X ═ XnN ═ 1,2,. N } has:
Figure BDA0001327850060000041
at this time:
λ=argλmax(P(X|λ)) (8)
solving equation (8) using the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation processk)≥P(X|λk-1),
K is the number of iterations until the model parameter lambda, the weight coefficient P of the Gaussian component in the iteration process (w)i) Mean vector μiCovariance matrix sigmaiThe iterative formula of (a) is as follows:
Figure BDA0001327850060000042
Figure BDA0001327850060000043
Figure BDA0001327850060000044
Figure BDA0001327850060000045
4-4) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient being less than t1And its nearest neighbor component N (P (w)j),μji) If the Euclidean distance between the two is smaller than the threshold value D, merging the two:
Figure BDA0001327850060000046
at the moment, the number of the Gaussian components is changed into M-1, the step 4-3) is returned to for next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected for merging;
4-5) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient is greater than t2And if the variance of at least one dimension in the covariance matrix is larger than sigma, the Gaussian component is considered to contain excessive information, and the Gaussian component is split:
Figure BDA0001327850060000051
e is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components is changed into M +1 after splitting, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step 4-3) is returned for next training;
4-6) finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing the lambda.
Further, the BLFW + AS conversion function constructed in step 5 of the voice conversion method proposed by the present invention is expressed AS follows:
F(x)=Wα(x,λ)x+s(x,λ) (15)
Figure BDA0001327850060000052
Figure BDA0001327850060000054
wherein M is the number of gaussian components of the mixture gaussian model in step 4, α (x, λ) represents the frequency warping factor, and s (x, λ) represents the amplitude adjustment factor.
Further, the speech conversion method provided by the present invention establishes a conversion relationship between the source speech pitch frequency and the target speech pitch frequency in step 5:
Figure BDA0001327850060000055
wherein μ, σ2Respectively representing logarithmic pitch frequency log f0Mean and variance of.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. the invention realizes the voice conversion under the condition of non-parallel texts, solves the problem that parallel linguistic data are not easy to obtain, and improves the universality and the practicability of the voice conversion system.
2. The invention uses the combination of AGMM and BLFW + AS to realize a voice conversion system, and the system can adaptively adjust the classification number of GMM according to the voice characteristic parameter distribution of different speakers, improves the voice quality while enhancing the voice personality similarity, and realizes high-quality voice conversion.
Drawings
FIG. 1 is a schematic illustration of the non-parallel text to speech conversion of the present invention.
FIG. 2 is a flow chart of adaptive Gaussian mixture model training.
FIG. 3 is a comparison of the speech spectra of the converted speech.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The high-quality voice conversion method of the invention is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.
As shown in fig. 1, the training part implements the steps of:
step 1, inputting the voice non-parallel corpora of a source speaker and a target speaker, wherein the non-parallel corpora is taken from a CMU _ US _ ARCTIC corpus, the corpus is established by the language technology research institute of Chimeron university in a card, the voice in the corpus is recorded by 5 men and 2 women, and each speaker records 1132 sections of 1-6 s of unequal voice.
Step 2, the invention uses AHOcoder speech analysis model to respectively extract Mel-Frequency Cepstral Coefficient (MFCC) X, Y and logarithmic base audio Frequency parameter logf of source speaker and target speaker0XAnd log f0Y. Wherein AHOcoder is a high-performance voice analysis and synthesis tool constructed by the group of Aholoab Signal processing laboratory scholars of Bilbao (Bilbao) Spanish;
and 3, performing voice characteristic parameter alignment and Dynamic Time Warping (DTW) combining unit selection (UnitSelection) and Vocal Tract Length normalization (VTLN, Vocal Tract Length No. 6 normalization) on the MFCC parameters X, Y of the source and target voices in the step 2. The specific process of aligning the voice characteristic parameters is as follows:
3-1) performing sound channel length normalization processing on the source speech characteristic parameters by adopting a bilinear frequency warping method to enable formants of the source speech to be close to target speech, thereby increasing the accuracy of unit selection of the target speech characteristic parameters.
3-2) for a given N source speech feature parameter vectors { XkThe N target voice characteristic parameter vectors { Y } can be dynamically searched through the formula (1)kLet the distance cost function value C ({ Y)k}) is smallest. Two factors are considered in the cell selection process: one aspect is to ensure that the spectral distance between the aligned source speech feature parameter vector and the feature parameter vector of the target speech is minimized to enhance the phonemeMatching degree of information; and on the other hand, the selected target speech feature parameter vector is ensured to have frame continuity, so that the phoneme information is more complete.
C({Yk})=C1({Yk})+C2({Yk}) (1)
Wherein, C1({Yk}) and C2({Yk}) may be respectively represented by the following formulae:
Figure BDA0001327850060000071
Figure BDA0001327850060000072
wherein, D (X)k,Yk) The function represents the spectral distance between the source and target characteristic parameter vectors, and the Euclidean distance is used as a distance measurement scale in the invention. The parameter gamma represents a balance coefficient between accuracy of feature parameter frame alignment and continuity between frames, and has 0. ltoreq. gamma. ltoreq.1. C1({Yk}) is the spectral distance cost function between the source speech feature parameter vector and the feature parameter vector of the target speech, C2({YkAnd) represents a spectral distance cost function between feature parameter vectors of the unit-selected target speech.
3-3) obtaining a characteristic parameter sequence set aligned with the source speech characteristic parameter vector by carrying out multiple linear regression analysis on the formula (1)
Figure BDA0001327850060000073
Namely:
Figure BDA0001327850060000074
for the solution of equation (4), a Viterbi (Viterbi) search method may be used to optimize the execution efficiency of the algorithm.
The non-parallel MFCC parameters X, Y are transformed to be parallel through the steps described above.
And 4, establishing an Adaptive Gaussian Mixture Model (AGMM), training by adopting an Expectation-Maximization (EM) algorithm, and obtaining an initial value of EM training by using a K-mean iteration method. The AGMM parameters λ, P (X | λ) are obtained by training.
As shown in fig. 2, training the AGMM parameters by using the adaptive clustering algorithm first requires a comprehensive analysis of the euclidean distances between the weight coefficients, the mean vectors, the covariance matrices, and the feature parameter vectors of each gaussian component, and dynamically adjusts the gaussian mixture. The training process is as follows:
4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t1,t2The Euclidean distance threshold D and the covariance threshold sigma between the feature parameter vectors.
4-2) obtaining an initial value of the EM training by using a K-mean iterative algorithm.
4-3) iterative training using EM algorithm.
The conventional gaussian mixture model is represented as follows:
wherein, X is the speech characteristic parameter vector of P dimension, the invention adopts P ═ 39, P (w)i) Weight coefficients representing the respective Gaussian components, and having
Figure BDA0001327850060000082
M is the number of Gaussian components, N (X, μ)i,∑i) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:
Figure BDA0001327850060000083
wherein muiIs a mean vector, ΣiIs a covariance matrix, λ ═ P (w)i),μiiIs the model parameter of the GMM model, and the estimation of the lambda can be realized by a Maximum Likelihood estimation method (ML, Maximum Likelihood estimation)To maximize the conditional probability P (X | λ), for the speech feature parameter vector set X ═ XnN ═ 1,2,. N } has:
Figure BDA0001327850060000084
at this time:
λ=argλmax(P(X|λ)) (8)
solving equation (8) may use the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation processk)≥P(X|λk-1) And K is the number of iterations until the model parameter λ. Weight coefficient P (w) of Gaussian component in iterative processi) Mean vector μiCovariance matrix sigmaiThe iterative formula of (a) is as follows:
Figure BDA0001327850060000087
4-4) some Gaussian component N (P (w) in the model obtained by trainingi),μii) The weight coefficient being less than t1And its nearest neighbor component N (P (w)j),μji) If the euclidean distance between the two components is smaller than the threshold D, the two components are considered to contain less information and have similar components, and the two components can be merged:
Figure BDA0001327850060000092
at this time, the number of Gaussian components is changed to M-1, the process returns to the step (3) to carry out the next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected to be merged.
4-5) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient is greater than t2And if the variance of at least one dimension in the covariance matrix (the element on the diagonal of the covariance matrix is the variance) is larger than σ, the gaussian component is considered to contain excessive information, and the process should be split:
Figure BDA0001327850060000093
and (3) wherein E is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components after splitting is changed into M +1, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step (3) is returned for next training.
4-6) finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing the lambda.
Step 5, training by using the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in step 3 and the posterior conditional probability matrix P (X | λ) obtained in step 4 to obtain a Frequency Warping factor and an amplitude adjusting factor, thereby constructing Bilinear Frequency Warping (BLFW) and amplitude Adjusting (AS) speech conversion functions, which are expressed AS follows:
F(x)=Wα(x,λ)x+s(x,λ) (15)
Figure BDA0001327850060000101
Figure BDA0001327850060000102
Figure BDA0001327850060000103
establishing a conversion relation between the source voice pitch frequency and the target voice pitch frequency:
wherein μ, σ2For representing logarithmic pitch frequency log f0Mean and variance of.
As shown in fig. 1, the conversion part includes the following steps:
step 6, inputting the voice of a source speaker to be converted;
step 7, extracting 39-order MFCC characteristic parameters X' of the source speaker voice and source voice logarithmic fundamental tone frequency parameters log f by using an AHOdecoder voice analysis model0X′
Step 8, using λ ═ P (w) obtained during AGMM training in step 4i),μiiSubstituting the characteristic parameters X 'extracted in the step 7 into a formula (5) to obtain a posterior conditional probability matrix P' (X | lambda);
step 9, respectively substituting the frequency warping factor α (X, λ) and the amplitude adjusting factor s (X, λ) obtained by BLFW + AS training in step 5 and the posterior conditional probability matrix P '(X | λ) obtained in step 8 into formulas (15), (16), (17) and (18) to obtain an MFCC characteristic parameter Y' of the converted speech;
step 10, using the source speech logarithmic pitch frequency parameter log f obtained in step 70X′Substituting the result into formula (19) to obtain the logarithmic pitch frequency parameter log f of the converted speech0Y′
Step 11, using AHOdecoder speech synthesis model to combine Y' in step 9 and log f in step 100Y′The converted speech is obtained as input.
Further, as shown in fig. 3, the spectrogram of the converted speech obtained by the method of the present invention is compared with the INCA method, and the conversion direction is F1-M2 (female voice 1-male voice 2), which further verifies the advantage of higher spectrum similarity of the method adopted by the present invention compared with the INCA method. Among them, the INCA method is proposed in the literature (Erro D, Moreno A, Bonafontea. INCA algorithm for transforming voice conversion system from nonparallelcorra [ J ]. IEEE Transactions on Audio, Speech, and Language Processing,2010,18(5): 944-953.).
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (7)

1. A speech conversion method under the condition of non-parallel texts based on adaptive Gaussian clustering is characterized by comprising a training stage and a conversion stage, wherein the training stage comprises the following steps:
step 1, inputting non-parallel training corpora of a source speaker and a target speaker;
step 2, respectively extracting MFCC characteristic parameters X of the non-parallel training corpus of the source speaker, MFCC characteristic parameters Y of the non-parallel training corpus of the target speaker and source speech fundamental frequency log f by using an AHOcoder speech analysis model0XAnd the target fundamental frequency log f of speech0Y
Step 3, performing speech characteristic parameter alignment and dynamic time warping combining unit selection and vocal tract length normalization on the MFCC characteristic parameters X, Y in the step 2, thereby converting the non-parallel linguistic data into parallel linguistic data;
step 4, performing adaptive Gaussian mixture model AGMM training by using an expectation maximization EM algorithm, finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing an AGMM parameter lambda;
step 5, utilizing the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in the step 3, utilizing the posterior conditional probability matrix P (X | Lambda) in the step 4 to carry out bilinear frequency warping BLFW + amplitude adjustment AS training to obtain frequency warping factors α (X, Lambda) and amplitude adjustment factors s (X, Lambda) so AS to construct a BLFW + AS conversion function, and utilizing the mean value and the variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f0XAnd the target fundamental frequency log f of speech0YA fundamental frequency transfer function therebetween;
the conversion phase comprises the following steps:
step 6, inputting the voice of a source speaker to be converted;
step 7, extracting MFCC characteristic parameters X' and logarithmic fundamental frequency log f of source speaker voice by using AHOcoder voice analysis model0X′
Step 8, using the parameter lambda obtained in the AGMM training in the step 4 to obtain a posterior conditional probability matrix P' (X | lambda);
step 9, obtaining a converted MFCC characteristic parameter Y' by using the BLFW + AS conversion function obtained in the step 5;
step 10, using the base frequency conversion function obtained in step 5 to convert the base frequency into the logarithmic base frequency log f0X′Obtaining the converted logarithmic fundamental frequency log f0Y′
Step 11, using AHOdecoder speech synthesis model to convert the converted MFCC characteristic parameters Y' and logarithmic fundamental frequency logf0Y′The synthesis results in a converted speech.
2. The speech conversion method according to claim 1, wherein the step 3 comprises the following steps:
3-1) carrying out sound channel length normalization processing on the source speech MFCC characteristic parameters by adopting a bilinear frequency warping method;
3-2) MFCC feature parameter vector { X) for a given N source voiceskThe N target voice characteristic parameter vectors { Y } are dynamically searched through the formula (1)kLet the distance cost function value C ({ Y)k}) minimum;
C({Yk})=C1({Yk})+C2({Yk}) (1)
wherein, C1({Yk}) and C2({Yk}) are respectively represented by the following formulae:
Figure FDA0002290795990000021
Figure FDA0002290795990000022
wherein, D (X)k,Yk) The function represents the spectral distance, D (Y), between the source speech and target speech feature parameter vectorsk,Yk-1) The function represents the spectral distance between the target voice characteristic parameter vectors selected by the unit, the parameter gamma represents the balance coefficient between the accuracy of characteristic parameter frame alignment and the continuity between frames, and gamma is more than or equal to 0 and less than or equal to 1; c1({Yk}) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, C2({Yk}) represents a spectral distance cost function between target speech characteristic parameter vectors selected by the unit;
3-3) obtaining a target speech characteristic parameter sequence set aligned with the source speech characteristic parameter vector by carrying out multiple linear regression analysis on the formula (1)
Figure FDA0002290795990000023
Namely:
Figure FDA0002290795990000024
through the above steps, the MFCC feature X, Y in the non-parallel corpus is transformed into the alignment feature parameter set in the similar parallel corpus.
3. The speech conversion method according to claim 2, wherein for the solution of equation (4), a viterbi search method is used to optimize the execution efficiency of the algorithm.
4. The speech conversion method of claim 1, wherein the training process of step 4 is as follows:
4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t1,t2The Euclidean distance threshold D and the covariance threshold sigma between the characteristic parameter vectors;
4-2) obtaining an initial value of EM training by using a K-mean iterative algorithm;
4-3) carrying out iterative training by using an EM algorithm; the gaussian mixture model GMM is expressed as follows:
wherein X is a P-dimensional speech feature parameter vector, P (w)i) Weight coefficients representing the respective Gaussian components, and havingM is the number of Gaussian components, N (X, μ)ii) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:
Figure FDA0002290795990000031
wherein muiIs a mean vector, ΣiIs a covariance matrix, λ ═ P (w)i),μiiLambda is a model parameter of the GMM model, and the estimation of the lambda is realized by a maximum likelihood estimation method, and for a speech characteristic parameter vector set X ═ XnN ═ 1,2,. N } has:
Figure FDA0002290795990000032
at this time:
λ=argλmax(P(X|λ)) (8)
solving equation (8) using the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation processk)≥P(X|λk-1),
k is the number of iterations until a model parameter λ is obtained, the weight coefficient of the Gaussian component P (w) in the iteration processi) Mean vector μiCovariance matrix sigmaiThe iterative formula of (a) is as follows:
Figure FDA0002290795990000033
Figure FDA0002290795990000034
Figure FDA0002290795990000035
Figure FDA0002290795990000036
4-4) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient being less than t1And its nearest neighbor component N (P (w)j),μji) If the Euclidean distance between the two is smaller than the threshold value D, merging the two:
Figure FDA0002290795990000037
at the moment, the number of the Gaussian components is changed into M-1, the step 4-3) is returned to for next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected for merging;
4-5) some Gaussian component N (P (w) in the model obtained by trainingi),μi,∑i) The weight coefficient is greater than t2And if the variance of at least one dimension in the covariance matrix is larger than sigma, the Gaussian component is considered to contain excessive information, and the Gaussian component is split:
Figure FDA0002290795990000041
e is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components is changed into M +1 after splitting, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step 4-3) is returned for next training;
4-6) finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing the lambda.
5. The method of claim 4, wherein P-39.
6. The speech conversion method of claim 1, wherein the BLFW + AS conversion function constructed in step 5 is expressed AS follows:
F(x)=Wα(x,λ)x+s(x,λ) (15)
Figure FDA0002290795990000042
Figure FDA0002290795990000043
Figure FDA0002290795990000044
wherein M is the number of Gaussian components of the Gaussian mixture model in the step 4,
Figure FDA0002290795990000045
representing the posterior probability of the mth Gaussian component of the Speech feature vector x in the AGMM model λ, αmAnd smRespectively, the frequency warping factor and the amplitude adjustment factor of the mth gaussian component in the AGMM model λ, α (x, λ) represents a weighted combination of the frequency warping factors of all gaussian components of the AGMM model λ, and s (x, λ) represents a weighted combination of the amplitude adjustment factors of all gaussian components of the AGMM model λ.
7. The speech conversion method of claim 1, wherein the conversion relationship between the source speech pitch frequency and the target speech pitch frequency is established in step 5:
Figure FDA0002290795990000051
wherein μ, σ2Respectively representing logarithmic pitch frequency log f0Mean and variance of.
CN201710474281.8A 2017-06-21 2017-06-21 Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering Active CN107301859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710474281.8A CN107301859B (en) 2017-06-21 2017-06-21 Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710474281.8A CN107301859B (en) 2017-06-21 2017-06-21 Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering

Publications (2)

Publication Number Publication Date
CN107301859A CN107301859A (en) 2017-10-27
CN107301859B true CN107301859B (en) 2020-02-21

Family

ID=60136451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710474281.8A Active CN107301859B (en) 2017-06-21 2017-06-21 Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering

Country Status (1)

Country Link
CN (1) CN107301859B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945791B (en) * 2017-12-05 2021-07-20 华南理工大学 Voice recognition method based on deep learning target detection
CN108198566B (en) * 2018-01-24 2021-07-20 咪咕文化科技有限公司 Information processing method and device, electronic device and storage medium
CN108777140B (en) * 2018-04-27 2020-07-28 南京邮电大学 Voice conversion method based on VAE under non-parallel corpus training
CN109671423B (en) * 2018-05-03 2023-06-02 南京邮电大学 Non-parallel text-to-speech conversion method under limited training data
CN110556092A (en) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 Speech synthesis method and device, storage medium and electronic device
US11605371B2 (en) * 2018-06-19 2023-03-14 Georgetown University Method and system for parametric speech synthesis
CN109377978B (en) * 2018-11-12 2021-01-26 南京邮电大学 Many-to-many speaker conversion method based on i vector under non-parallel text condition
CN109326283B (en) * 2018-11-23 2021-01-26 南京邮电大学 Many-to-many voice conversion method based on text encoder under non-parallel text condition
CN109671442B (en) * 2019-01-14 2023-02-28 南京邮电大学 Many-to-many speaker conversion method based on STARGAN and x vectors
CN110164463B (en) * 2019-05-23 2021-09-10 北京达佳互联信息技术有限公司 Voice conversion method and device, electronic equipment and storage medium
CN110782908B (en) * 2019-11-05 2020-06-16 广州欢聊网络科技有限公司 Audio signal processing method and device
CN111640453B (en) * 2020-05-13 2023-06-16 广州国音智能科技有限公司 Spectrogram matching method, device, equipment and computer readable storage medium
US20210383790A1 (en) * 2020-06-05 2021-12-09 Google Llc Training speech synthesis neural networks using energy scores
CN113112999B (en) * 2021-05-28 2022-07-12 宁夏理工学院 Short word and sentence voice recognition method and system based on DTW and GMM

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Also Published As

Publication number Publication date
CN107301859A (en) 2017-10-27

Similar Documents

Publication Publication Date Title
CN107301859B (en) Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering
Liu et al. Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance.
Sun et al. Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams.
Chen et al. Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding.
US7957959B2 (en) Method and apparatus for processing speech data with classification models
Siu et al. Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery
US9355642B2 (en) Speaker recognition method through emotional model synthesis based on neighbors preserving principle
US8423364B2 (en) Generic framework for large-margin MCE training in speech recognition
CN107103914B (en) High-quality voice conversion method
Wang et al. Accent and speaker disentanglement in many-to-many voice conversion
Thai et al. Synthetic data augmentation for improving low-resource ASR
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
Hanilçi et al. Comparison of the impact of some Minkowski metrics on VQ/GMM based speaker recognition
CN107068165B (en) Voice conversion method
Metze et al. The 2010 CMU GALE speech-to-text system.
Miyajima et al. A new approach to designing a feature extractor in speaker identification based on discriminative feature extraction
Dusan Estimation of speaker's height and vocal tract length from speech signal.
Li et al. The Hokkien isolated word recognition system based on FPGA
Gonzalez-Rodriguez Speaker recognition using temporal contours in linguistic units: The case of formant and formant-bandwidth trajectories
Thomson et al. Use of voicing features in HMM-based speech recognition
Ma et al. Speaker cluster based GMM tokenization for speaker recognition.
Laskar et al. HiLAM-state discriminative multi-task deep neural network in dynamic time warping framework for text-dependent speaker verification
CN108510995B (en) Identity information hiding method facing voice communication
Song et al. Experimental study of discriminative adaptive training and MLLR for automatic pronunciation evaluation
Qiao et al. HMM-based sequence-to-frame mapping for voice conversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant