CN103345920B

CN103345920B - Self-adaptation interpolation weighted spectrum model voice conversion and reconstructing method based on Mel-KSVD sparse representation

Info

Publication number: CN103345920B
Application number: CN201310211046.3A
Authority: CN
Inventors: 汤一彬; 沈媛; 朱昌平; 周浩; 高远; 单鸣雷; 姚澄
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2013-05-29
Filing date: 2013-05-29
Publication date: 2015-07-15
Anticipated expiration: 2033-05-29
Also published as: CN103345920A

Abstract

The invention belongs to the field of voice signal processing and discloses a self-adaptation interpolation weighted spectrum model voice conversion and reconstructing method based on Mel-KSVD sparse representation. According to the method, the problem of data compressing of model parameters is fully considered, after a smooth power spectrum is extracted in the voice analysis stage, the Mel-KSVD method is utilized for conducting related sparse coefficient representation on the extracted smooth power spectrum parameters, meanwhile, in the process of the sparse representation, a dictionary is continuously updated through the dictionary self-adaptation learning strategy, and the sparse coefficient is optimized. The simulation result shows that compared with a traditional model with fewer sparse coefficients, the model is wholly equal to or better than the traditional model in the synthetic voice quality, and the male voice is even superior to that of the traditional KSVD sparse representation model. In addition, compared with a Mel frequency cepstrum coefficient compressed model, the model is better in the voice synthesis quality.

Description

Based on speech conversion and the reconstructing method of the adaptive interpolation weighted spectral model of Mel-KSVD rarefaction representation

Technical field

The invention belongs to field of voice signal, relate to a kind of speech conversion and reconstruction model, particularly a kind of speech conversion of the adaptive interpolation weighted spectral model based on Mel-KSVD rarefaction representation and reconstructing method.

Background technology

Speech parameter and reconstruct be one important and there is certain challenging problem, the speech analysis-synthesis system of its correspondence is widely used in various field, as voice coding, conversion etc.

Show in " speech conversion and reconstructing method based on the adaptive interpolation weighted spectral model " document delivered in April, 1999 people such as H.Kawahara, based on speech conversion and the reconstruction model of adaptive interpolation weighted spectral, abandon the structure of glottis, sound channel in traditional voice model, the power spectrum of extracting directly voice, obtains high-quality phonetic synthesis effect.It becomes the speech analysis synthetic model of current main flow gradually, is widely used in each side such as phonetic synthesis, speech conversion.It adopts with VOCODER the thought of the source filter being prototype to characterize voice signal, and voice signal is regarded as the result of pumping signal by exporting after time-varying linear filter.After analyzing the phonetic speech power spectrum obtaining each frame, the smoothing processing on time-frequency domain is carried out to this power spectrum, on time shaft and frequency axis, carries out over-sampling simultaneously, ensure that synthesis phase reconstructs the high-quality of voice.

In recent years, sparse representation theory obtained very fast development, and was applied to numerous areas, as: image noise reduction, blind source separating, speech enhan-cement etc.Above-mentioned application is all the related sparse coefficient in order to obtain sparse territory, characterizes the internal characteristics of voice signal.Itself also there are some defects in STRAIGHT model.The smooth power spectrum envelop parameter gone out through STRAIGHT model extraction has suitable redundant information, and this model value obtains further perfect.But scholars seldom pay close attention to the improvement of STRAIGHT model, therefore, how to be combined with sparse representation theory by STRAIGHT model, further compact model parameter becomes a major issue of the further application and development of this model of restriction.

Summary of the invention

The object of the invention is to overcome the problems referred to above, a kind of speech conversion and reconstructing method of the adaptive interpolation weighted spectral model based on Mel-KSVD rarefaction representation are provided, realize while maintenance synthetic speech quality is substantially constant, STRAIGHT model is combined with sparse representation theory, model output parameters is further compressed, reduce the number of the transmission of parameter, reduce STRAIGHT model calculated amount, thus improve the synthesis quality of voice.

Technical scheme of the present invention is considered from the following aspect: STRAIGHT model is a kind of speech model based on power spectrum.Its smooth power spectrum parameter is a kind of power spectrum after time-frequency domain compensates, and has certain redundant information.Therefore by the output parameter of the method compact model of Mel-KSVD, rarefaction representation is carried out to it, also finally reach the number of the transmission reducing parameter according to the sparse coefficient synthetic speech obtained, reduce the object of STRAIGHT model calculated amount.

Technical scheme of the present invention is as follows:

Based on speech conversion and the reconstructing method of the adaptive interpolation weighted spectral model of Mel-KSVD rarefaction representation, it is characterized in that, utilize the method for Mel-KSVD to carry out rarefaction representation to the smooth power spectrum parameter extracted through STRAIGHT analytical model, comprise following steps:

(1) voice signal to be synthesized is inputted, voice signal is extracted level and smooth spectrum by STRAIGHT analytical model: first adopt time-frequency penalty method to extract power spectrum, then again low-frequency band compensation carried out to power spectrum and cross smooth compensating, finally the tone-off frame of power spectrum is processed, to obtain smooth power spectrum, the parameter of smooth power spectrum forms a data matrix, is set to Y=[y ₁..., y _m];

(2) the smooth power spectrum parameter extracted is by carrying out the training of dictionary after Mei Er wave filter, recycling Mel-KSVD algorithm is to formula: constraint condition is carry out the Optimization Solution of parameter D and X,

Wherein M is the matrix of coefficients of Mei Er bank of filters, Y=[y ₁..., y _m] represent power spectrum parameters matrix, D=[d ₁..., d _k] be target training dictionary, d _irepresent an atom of dictionary, x _kfor y _kthe sparse spike that D projects, X=[x ₁..., x _m], || || _ffor Frobenius norm, || || ₀it is 0 norm;

(3) the target training dictionary of optimization is utilized with by Mei Er wave filter and Mel-KSVD algorithm the level and smooth spectrum parameter to the voice to be synthesized that STRAIGHT analytical model obtains carry out the sparse spike x that rarefaction representation obtains _k, and the sparse coefficient matrix X=[x that will obtain ₁..., x _m] synthesis of voice is carried out by STRAIGHT synthetic model; By estimating to power spectrum parameters matrix the synthesis carrying out voice, estimated matrix is solution formula is k=1,2 ..., M.

Further technical scheme comprises:

Algorithm described in step (2) is to formula constraint condition is carry out the Optimization Solution of D and X, carry out as follows:

(2a) in the dictionary training stage, target dictionary D and reconstructed error relevant; MD in objective function is seen as a complicated dictionary D _eq, dictionary D _eqin atom d _koptimization problem be classified as following formula:

< d_{eq, k}, δ_{k} > = \underset{d_{k}, x_{k}}{\arg \min} {| | E}_{eq, k} - d_{eq, k} δ_{k} {| |}_{F}^{2},

Wherein d _{eq, k}d _eqkth row, δ _kit is the row k of X;

(2b) adopt singular value decomposition algorithm to above formula process,

E _eq,k＝UΣV ^T，

{\tilde{d}}_{eq, k} = U (:, 1),

{\tilde{δ}}_{k} = Σ (1,1) * V (:, 1),

Wherein, U and V is unitary matrix, and Σ is diagonal matrix, and its kth diagonal element is E _ksingular value, U (:, 1) and V (:, 1) represents the first row of U and V respectively, and Σ (1,1) is the maximum singular value of Σ;

Obtain best dictionary atom be optimized for

When for all k=1,2 ..., M, carries out the iteration of sparse coefficient and dictionary updating, until and when substantially remaining unchanged, stopping the Optimization Solution to D, the dictionary now obtained is best dictionary export sparse coefficient matrix X=[x ₁..., x _m] and corresponding dictionary enter described step (3), otherwise repeat step (2a) and (2b).

The beneficial effect that the present invention reaches:

The speech conversion of Mel-KSVD sparse representation method and adaptive interpolation weighted spectral and reconstruct (STRAIGHT) model combine by the present invention, by compressing further the voice smooth power spectrum parameter extracted through STRAIGHT analytical model, the sparse coefficient utilizing Mel-KSVD rarefaction representation to obtain is passed to the reconstruct of STRAIGHT synthetic model and synthetic speech signal.The model that the present invention proposes is compared with traditional K-SVD model, and its synthetic speech quality is totally suitable, in male voice voice, be even more better than conventional model.In addition, owing to compressing smooth power spectrum parameter after the analysis phase, decrease the Transfer Parameters of model, the calculated amount of model is greatly reduced.

Accompanying drawing explanation

Fig. 1 is a kind of speech conversion of adaptive interpolation weighted spectral based on Mel-KSVD rarefaction representation of the present invention and the frame diagram of reconstruction model;

Fig. 2 is the sound spectrograph of the present invention to men and women's phonosynthesis voice, the first behavior male voice voice sound spectrograph, the second behavior female voice voice sound spectrograph;

Fig. 3 is the voice quality statistical graph of method of the present invention and other two kinds of Measures compare;

Synthetic speech quality statistical graph when Fig. 4 is different dictionary atom number in the present invention.

Embodiment

Below in conjunction with accompanying drawing, the speech conversion of a kind of adaptive interpolation weighted spectral model based on Mel-KSVD rarefaction representation of the present invention and reconstructing method are further elaborated.

As shown in Figure 1, a kind of speech conversion of the adaptive interpolation weighted spectral model based on rarefaction representation and reconstructing method, first the power spectrum parameters of training utterance signal is extracted based on STRAIGHT analytical model, then the method for Mel-KSVD is utilized to train dictionary D adaptively, utilize the method for Mel-KSVD to carry out rarefaction representation to power spectrum parameters, by constantly iteration renewal dictionary D and sparse spike x simultaneously _k, until reconstructed error value kept stable and till being less than certain threshold value, export target dictionary with sparse spike x _k.And then the target dictionary that will obtain with sparse spike x _kpass to STRAIGHT synthetic model, carry out the synthesis of voice.

As shown in Figure 1, based on speech conversion and the reconstructing method of the adaptive interpolation weighted spectral model of Mel-KSVD rarefaction representation, following steps are comprised:

(1) voice signal to be synthesized is inputted, passed through STRAIGHT analytical model and extracted level and smooth spectrum, namely time-frequency penalty method is first adopted to extract power spectrum, then again it is carried out to low-frequency band compensation, crosses smooth compensating, finally its tone-off frame is processed, just obtain smooth power spectrum, its parameter forms a data matrix, is set to Y=[y ₁..., y _m].

(2) smooth power spectrum extracted is by after Mei Er wave filter, and recycling Mel-KSVD algorithm is to formula:

\min_{D, X} ({| | M (Y - DX) | |}_{F}^{2} + λ Σ_{i = 1}^{M} {| | x_{i} | |}_{0}),

Constraint condition is

{| | M (Y - DX) | |}_{F}^{2} \leq ϵ,

Carry out the Optimization Solution of parameter D and X,

Wherein M is the matrix of coefficients of Mei Er bank of filters, Y=[y ₁..., y _m] represent power spectrum parameters matrix, D=[d ₁..., d _k] be target training dictionary, d _irepresent an atom of dictionary, x _kfor y _kthe sparse spike that D projects, X=[x ₁..., x _m] .|||| _ffor Frobenius norm, || || ₂it is 0 norm.

According to above-mentioned steps (2), utilize Mel-KSVD algorithm to formula:

\min_{D, X} ({| | M (Y - DX) | |}_{F}^{2} + λ Σ_{i = 1}^{M} {| | x_{i} | |}_{0}),

Constraint condition is

{| | M (Y - DX) | |}_{F}^{2} \leq ϵ

Carry out the Optimization Solution of D and X, it carries out as follows:

(2a) in the dictionary training stage, target dictionary D and reconstructed error relevant.MD in objective function is regarded as a complicated dictionary D at this _eq, dictionary D _eqin atom d _koptimization problem can be classified as following formula:

< d_{eq, k}, δ_{k} > = \underset{d_{k}, x_{k}}{\arg \min} {| | E_{eq, k} - d_{eq, k} | |}_{F}^{2},

Wherein d _{eq, k}d _eqkth row, δ _kit is the row k of X.

(2b) adopt svd (SVD) algorithm to above formula process,

E _eq,k＝UΣV ^T，

{\tilde{d}}_{eq, k} = U (:, 1),

{\tilde{δ}}_{k} = Σ (1,1) * V (:, 1),

Wherein, U and V is unitary matrix, and Σ is diagonal matrix, and its kth diagonal element is E _ksingular value, U (:, 1) and V (:, 1) represents the first row of U and V respectively, and Σ (1,1) is the maximum singular value of Σ.

Therefore, best dictionary is obtained its atom be optimized for

When for all k=1,2 ..., M, carries out the iteration of sparse coefficient and dictionary updating, until and almost remain unchanged constantly, stop the Optimization Solution to dictionary D, the dictionary now obtained is best dictionary export sparse coefficient matrix X=[x ₁..., x _m] and corresponding dictionary enter step (3), otherwise repeat to step (2a) and (2b).

(3) the target training dictionary of training dictionary module optimization is utilized with sparse spike x _k, by its sparse coefficient matrix X=[x ₁..., x _m] synthesis of voice is carried out by STRAIGHT synthetic model.Estimate power spectrum parameters matrix during synthesis, estimated matrix is its solution formula is k=1,2 ..., M.

Effect of the present invention can be further illustrated by following experiment:

1) experiment condition

In this experiment employing TIMIT sound bank, voice are as experimental data, and speech sampling rates is 8kHz, and voice frame length is 30ms, and frame displacement 1ms, spectrum analysis adopts the Fast Fourier Transform (FFT) of 1024.Adopt Matlab R2011b as emulation tool, allocation of computer is Intel Duo i53210/4G.

2) experiment content

Experiment respectively to utilizing men and women's sound voice of Mel cepstrum coefficient (MFCC) compression method, KSVD rarefaction representation algorithm and Mel-KSVD rarefaction representation algorithm of the present invention to carry out the synthesis of voice respectively, and is contrasted to the sound spectrograph of the voice utilizing said method to synthesize and the sound spectrograph of raw tone.When making said method synthetic speech quality statistical graph, with the voice of former STRAIGHT model synthesis for fundamental tone.Finally, also in different dictionary atom number situation, the Mel-KSVD algorithm that the present invention the is proposed statistics of having made voice quality with compare.

First, the sound spectrograph of synthesis men and women sound voice is compared, result as shown in Figure 2, wherein Fig. 2 (a), e () is respectively original Nan ﹑ female voice voice, Fig. 2 (b), f () is respectively the synthetic speech of MFCC compression method, Fig. 2 (c), g () is respectively the synthetic speech of KSVD rarefaction representation algorithm, Fig. 2 (d), and (h) is respectively the synthetic speech of Mel-KSVD rarefaction representation algorithm of the present invention, the dictionary atom number that wherein the wave filter number of MFCC and Mel-KSVD is set to 70, KSVD and Mel-KSVD is set to 70;

Secondly, to utilizing men and women's phonosynthesis voice quality of above-mentioned three kinds of methods to contrast respectively, the dictionary atom number that wherein the wave filter number of MFCC and Mel-KSVD is set to 70, KSVD and Mel-KSVD is set to 90, and result as shown in Figure 3.

Finally, to utilizing Mel-KSVD algorithm of the present invention, the men and women's phonosynthesis voice quality when different dictionary atom number is contrasted, and wherein the wave filter number of Mel-KSVD is set to 70, and result as shown in Figure 4.

3) interpretation

As can be seen from Figure 2, when the synthetic speech of Mel-KSVD method of the present invention, traditional KSVD algorithm is respectively compared with MFCC compression method, before the synthetic effect of low band speech of two kinds of methods better, as the place's of drawing a circle instruction in figure.The synthetic speech effect in low-frequency band as Fig. 2, Mel-KSVD method and traditional KSVD algorithm is suitable, and this is mainly because Mel wave filter arranges relative close in low-frequency band.But, the comparatively strong and male voice voice of rule for harmonic wave, the present invention is better than traditional KSVD algorithm in the effect of high frequency band; For the female voice voice that harmonic wave changes greatly, harmonic performance enhancing may make schoolgirl's synthetic speech become machinery, and schoolgirl's voice quality of therefore the present invention's generation is only a little more than the voice of traditional KSVD algorithm synthesis;

In the voice quality statistical graph that the distinct methods of Fig. 3 synthesizes, the evaluation of voice quality adopts voice to experience quality evaluation (PESQ) for objective evaluation index.As can be seen from Figure 3, compared with original STRAIGHT model, no matter men and women's sound voice, Mel-KSVD algorithm of the present invention obtains higher PESQ score, all improves about 0.05.Because be an accurate language spectrum through STRAIGHT model level and smooth spectrum out, wherein noise estimates according to adjacent harmonic wave, namely in the level and smooth spectrum extracted, introduces noise.And the target of sparse representation theory is the major component of restoring signal, ignores the composition of similar noise, with noise reduction process similar process, so algorithm of the present invention is relatively better to the noise processed ground introduced in level and smooth spectrum.As shown in Figure 3, compared with MFCC compression method, synthetic speech quality of the present invention is better, especially male voice voice, and its PESQ score improves close to 0.1, and concerning female voice voice, its PESQ score improves nearly 0.05.As Fig. 3, because the present invention introduces the Mel wave filter of auditory perceptual, compared with traditional KSVD method, synthetic speech quality of the present invention also slightly improves.

As can be seen from Figure 4, and when different dictionary atom numbers (dictionary atom number is respectively 30, and 50,70,90), the performance that the present invention is based on the STRAIGHT model synthetic speech quality of Mel-KSVD algorithmic notation is also different, to the synthesis quality of men and women's sound also difference to some extent.As shown in Figure 4, along with the increase of dictionary atom number, male voice phonetic synthesis quality is improving always, and when adopting 90 dictionary atomic update dictionary rarefaction representation power spectrum parameters, its synthetic speech quality is best, improves nearly 0.1 when being 30 than atomicity.This is because along with the increase of atomicity, rarefaction representation is more and more accurate.Visible in Fig. 4, but for female voice voice, when atomicity 70, synthetic effect is best, and after atomicity is more than 70, the increase of atom number causes the decline of synthetic speech quality on the contrary.This is because the excessive sparse rarefaction representation introducing too much noise contribution accurately.

The above is only the preferred embodiment of the present invention; here should particularly point out; for those skilled in the art; under the prerequisite not departing from the technology of the present invention principle; can also make some improvement and distortion, these improve and distortion also should be considered as protection scope of the present invention.

Claims

1. based on speech conversion and the reconstructing method of the adaptive interpolation weighted spectral model of Mel-KSVD rarefaction representation, it is characterized in that, utilize the method for Mel-KSVD to carry out rarefaction representation to the smooth power spectrum parameter extracted through STRAIGHT analytical model, comprise following steps:

(1) voice signal to be synthesized is inputted, voice signal is extracted level and smooth spectrum by STRAIGHT analytical model: first adopt time-frequency penalty method to extract power spectrum, then again low-frequency band compensation carried out to power spectrum and cross smooth compensating, finally power spectrum tone-off frame is processed, to obtain smooth power spectrum, the parameter of smooth power spectrum forms a data matrix, is set to Y=[y ₁..., y _m];

Wherein M is the matrix of coefficients of Mei Er bank of filters, Y=[y ₁..., y _m] represent power spectrum parameters matrix, D=[d ₁..., d _k] be target training dictionary, d _irepresent an atom of dictionary, x _kfor y _kthe sparse spike that D projects, X=[x ₁..., x _m], ε is reconstruct error threshold, || .|| _ffor Frobenius norm, || .|| ₀it is 0 norm;

(3) the target training dictionary of optimization is utilized with by Mei Er wave filter and Mel-KSVD algorithm the level and smooth spectrum parameter to the voice to be synthesized that STRAIGHT analytical model obtains carry out the sparse spike x that rarefaction representation obtains _k, and the sparse coefficient matrix X=[x that will obtain ₁..., x _m] synthesis of voice is carried out by STRAIGHT synthetic model; By estimating to power spectrum parameters matrix the synthesis carrying out voice, estimated matrix is solution formula is k=1,2 ..., M;

< d_{eq, k}, δ_{k} > = \underset{d_{k}, x_{k}}{\arg \min} {| | E_{eq, k} - d_{eq, k} δ_{k} | |}_{F}^{2},

Wherein d _{eq, k}d _eqkth row, δ _kit is the row k of X;

(2b) adopt singular value decomposition algorithm to above formula process,

E _eq,k＝UΣV ^T,

{\tilde{d}}_{eq, k} = U (:, 1),

{\tilde{δ}}_{k} = Σ (1,1) * V (:, 1),

Obtain best dictionary atom be optimized for

When for all k=1,2 ..., M, carries out the iteration of sparse coefficient and dictionary updating, until time, stop the Optimization Solution to D, the dictionary now obtained is best dictionary export sparse coefficient matrix X=[x ₁..., x _m] and corresponding dictionary enter described step (3), otherwise repeat step (2a) and (2b).