CN100440314C

CN100440314C - High quality real time sound changing method based on speech sound analysis and synthesis

Info

Publication number: CN100440314C
Application number: CNB2004100623371A
Authority: CN
Inventors: 孟猛; 张树武
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2004-07-06
Filing date: 2004-07-06
Publication date: 2008-12-03
Anticipated expiration: 2024-07-06
Also published as: CN1719514A

Abstract

The present invention relates to a high-quality real time speech changing method based on speech sound analysis and synthesis, which belongs to the field of speech converter technique. The signals are interpolated or inspected according to requirements of time length change in a time domain, an amplitude spectrum and a phase spectrum are respectively processed through the conversion from the time domain to a frequency domain to separate and independently adjust base frequency and formant, the influence to the base frequency and the format from time length adjustment is compensated during adjusting, and finally, time domain signals are restored. The time domain signals are converted to the frequency domain through the fast Fourier transform, the positions of the base frequency of speech and formant positions are separated and are respectively adjusted, speech is synthesized again, and therefore, sound length, pitch and tone color are adjusted to change sound intensity and realize voice conversion. The method can process speech in real time, can be directly used in the recreation fields of network telephones, speech chat rooms, etc. and can also be used in the practical fields of dubbing, music synthesis, etc. Simultaneously, the method can also be used for speech synthesis to improve the whole tone quality of synthetic speech.

Description

Based on speech analysis and synthetic high-quality real-time change of voice method

Technical field

The present invention relates to the speech transformation technique field, particularly a kind of based on speech analysis and synthetic high-quality real-time change of voice method.

Background technology

Speech transformation technique is used to change acoustic features such as the tone of voice and speed, thereby the intention according to people produces the new feature that suits the requirements, it has practical application widely in many aspects, for example dubs, music is synthetic, Internet chat, sound are maintained secrecy or the like.This technology has been widened the research range of speech processes, makes the application of the speech processes more diversification that becomes.

The basic physical features of voice comprises pitch, loudness of a sound, tonequality and the duration of a sound.Pitch is that the frequency high sound is just high by the height decision of pronunciation object vibration frequency, and the low sound of frequency is just low.Vocal cords such as women and children are relatively shorter and thinner, vibration frequency of vocal band height when singing in a minute, and man and old man's vocal cords are long and thicker, and vibration frequency of vocal band is low when singing in a minute, thereby man and old man's sound is overcast compared with female voice and child's voice.Can change pitch by changing fundamental frequency.The power of the corresponding sound of loudness of a sound is decided by the amplitude of sound, and promptly the size by vibration is determined.Tonequality is tone color again, is exactly essence, the characteristic of sound, and it depends on the form of acoustic vibration, is the most basic feature that different sound can be distinguished mutually, performs same tune as voice, piano sound, violin sound, sounds having nothing in common with each other.Resonance peak has reflected the outstanding harmonic components of component in the sound, thereby thinks that height, position and the number affects of resonance peak tone color.The duration of a sound is exactly the length of sound, is decided by the time of sounding body vibration.

As the fundamental of sound, any factor is not self-existent in pitch, loudness of a sound, tonequality and the duration of a sound.Generally only change wherein a kind ofly, severally in addition also can change thereupon.For example, just can change the word speed of voice by the sample frequency that changes the audio digital signals broadcast, promptly change the duration of a sound, but meanwhile, the fundamental frequency of voice and resonance peak position also change simultaneously, thereby be not that variation has taken place word speed in the sound that we hear, variation has also taken place in tone color and pitch, and whole speaker's feature changes so much that one loses one's identity.For another example, only the fundamental frequency in the voice is carried out proportional zoom, behind the synthetic speech, the position of resonance peak also can be moved with fundamental frequency again, and tone color changes equally.These problems need to be resolved in speech transformation technique.

The relation of four kinds of factors that the present invention is clear and definite, by modes such as separation, compensation, pitch, loudness of a sound, tonequality and the duration of a sound have been realized independent adjustment, thereby can adjust voice speaker's feature such as tone color, tone, word speed flexibly, reach multiple speaker ' s identity (old man, child, adult males, girls etc.) high-quality simulation.

Summary of the invention

The object of the present invention is to provide a kind of based on speech analysis and synthetic high-quality real-time change of voice method.

This method changes the influence that cause by studying its difference to voice based on the understanding to the physical attribute of voice, obtains the method that a kind of method by digital signal processing changes the speaker ' s identity feature of voice.The present invention is based on the time frequency analysis of digital signal, cut the length that changes voice by the interpolation on the time domain with taking out, by short time discrete Fourier transform time-domain signal is transformed on the frequency domain, adjust the spectrum envelope shape of phase spectrum, amplitude spectrum and amplitude spectrum, the fundamental frequency that reaches speech separates with the resonance peak position, thus the purpose that can adjust respectively, and the feature after will changing at last synthesizes voice signal again, change characteristic of Voice, realized the change of voice.The present invention has realized the independent regulation of fundamental frequency, resonance peak position and time span, loudness of a sound, thereby can adjust voice speaker's feature such as tone color, pitch, word speed flexibly, reach high-quality simulation to multiple speaker's sex and age characteristics (old man, child, adult males, girls etc.).

A kind of based on speech analysis and synthetic high-quality real-time change of voice method, based on Fourier analysis and synthetic technology, comprise the steps: on time domain, signal to be carried out interpolation or take out and cut according to the requirement of duration change, transform to frequency domain then, amplitude spectrum and phase spectrum are handled respectively, separated fundamental frequency and resonance peak, and it is carried out independent regulation, compensation duration adjustment the two influence to this recovers time-domain signal at last during adjusting.Method has processing speed and high-quality treatment effect fast, can satisfy the requirement of real-time and practicality simultaneously.

Independent adjustment is carried out in fundamental frequency and resonance peak position to signal on frequency domain, fundamental frequency and resonance peak position are separated, the fundamental frequency and the harmonic wave thereof of voice signal both can have been changed, can keep the resonance peak position again simultaneously or the resonance peak position is arbitrarily adjusted, realize that the independent of tone color and pitch changes.

Directly the time span to voice signal changes on time domain, by interpolation or take out to cut digital signal realize is resampled, thereby elongate or shorten the time scale of voice, again fundamental frequency and the resonance peak position that changes therefrom compensated, thereby realize the independent effect that word speed is changed.

Energy to signal is added up, and adjusts the energy ratio of input/output signal in real time, thereby can change the voice intensity of output signal flexibly.

Adjustment respectively to amplitude spectrum and phase spectrum, by asking for the spectrum envelope of amplitude spectrum, and based on this, the spectrum envelope of the new amplitude spectrum that carried out the adjusted spectrum signal of fundamental frequency is carried out in shape change, under the prerequisite that does not influence fundamental frequency, realize random adjustment to the resonance peak position.

The present invention is based on speech analysis and synthetic technology, as shown in Figure 2.

Voice signal is regarded a kind of stationary signal in short-term as, can be transformed into frequency domain to voice signal by short time discrete Fourier transform and carry out analyzing and processing.When doing short time discrete Fourier transform, the length of time window can not be too short, will comprise several fundamental frequency cycles usually, simultaneously, because restriction stably in short-term, can not be oversize, guarantee that the variation of physical characteristics within the frame is not obvious.For voice, the fundamental frequency of male voice is lower, usually about 125HZ, and the about 8ms of fundamental frequency cycles, therefore, the length that can get time window usually is near the 24ms to 32ms.In digital signal processing, the length of the window function i.e. data sample of a frame is counted, and its size depends on the sampling rate size of this voice signal.

Carry out short time discrete Fourier transform, be equivalent to this frame voice signal elder generation windowing, calculated signals fourier series to behind periodic extension on the time shaft, obtaining again, be the stack that is expressed as one group of multiple sinusoidal signal behind this frame signal periodic extension, the fourier coefficient that conversion obtains is the amplitude of these multiple sinusoidal components.If the frequency values of the multiple sinusoidal component of each that will obtain is adjusted to new frequency values by multiply by a certain scale-up factor p simultaneously, the fundamental frequency and the harmonic frequency thereof of the time domain voice signal after synthesizing again through inverse fourier transform have so more also been taken advantage of scale-up factor p simultaneously, thereby realize the change to the original signal fundamental frequency.

In practical methods, short time discrete Fourier transform is realized by windowing and fast Fourier transform (FFT).After the conversion, in order further to finish adjustment and the synthetic again time-domain signal of use invert fast fourier transformation (IFFT) to the each component frequency values, need earlier the fourier coefficient that obtains to be transformed into polar coordinates by rectangular coordinate, promptly obtain its amplitude spectrum and phase spectrum.Both do convenient the realization separating of fundamental frequency and resonance peak position like this, finish following equivalence again easily and realize: promptly the original frequency value f that changes a certain multiple sinusoidal signal component ₁To another frequency values p*f ₁, become at fixed frequency f ₂On to change the amplitude of this component and phase value into corresponding original frequency be f ₂The amplitude of the component of/p and phase value, thus it is synthetic directly to use IFFT to realize.

Concerning amplitude spectrum, finish appeal and only handle and to carry out interpolation by proportional parts or take out and cut and to finish original amplitude spectrum.And for phase spectrum, then need earlier phase spectrum to be launched, as shown in Figure 3.In a certain frame, the frequency values f of the multiple sinusoidal signal component of a certain frequency ₁Adjust to p*f ₁, the variable quantity of the phase place of this component in this frame also will become p times of original variable quantity, and the variation of this phase place can be accumulated on the initial phase of next frame frame by frame.In order to realize this adjustment of phase spectrum, method is that the phase differential (be phase changing capacity former frame in) of phase spectrum on adjacent two frames after launching is adjusted into original p doubly, and the initial phase that obtains through accumulation also will become original p doubly again.

The method of deploying of phase spectrum:

Suppose that the shift time length between two frames is t _w, be f for frequency _kMultiple sinusoidal wave component, its at t (t＞1) constantly, and the theoretical value of the phase changing capacity between the former frame is

ΔΦ _k ^(t)＝2π·f _k·t _w.

Initial phase difference between actual two frames is

Δθ _k ^(t)＝θ _k ^(t)-θ _k ^(t-1).

Definition

Δφ _k ^(t)＝(Δθ _k ^(t)-ΔΦ _k ^(t))MOD2π+ΔΦ _k ^(t)，

Wherein

So, Δ φ _k ^(t)The phase changing capacity of adjacent two interframe after promptly launching constantly as t.By adding up, obtain the initial phase after t launches constantly:

\{\begin{matrix} {\tilde{θ}}_{k}^{(t)} = {\tilde{θ}}_{k}^{(t - 1)} + Δ {φ_{k}}^{(t)}, \\ {\tilde{θ}}_{k}^{(1)} = {θ_{k}}^{(1)} . \end{matrix}

Said as the front, when changing fundamental frequency, only need carry out interpolation by proportional parts or take out and cut original amplitude spectrum for amplitude spectrum.But do like this, when having changed fundamental frequency, also moved the position of resonance peak in same ratio.So, need to introduce other method and under the situation that does not influence fundamental frequency, adjust resonance peak.This method is to reach final purpose by the spectrum envelope that extracts amplitude spectrum.

Below formula in, e (n) is the spectrum envelope of original amplitude spectrum before adjusting, by top disposal route, fundamental frequency improve p doubly after, spectrum envelope becomes thereupon

And have

\hat{e} (n) = e (\frac{n}{p}),

Be the adjusted amplitude spectrum of process interpolation, Be the amplitude spectrum after resonance peak is compensated.Have

\tilde{a} (n) = \frac{e (n)}{\hat{e} (n)} \hat{a} (n) = \frac{e (n)}{e (\frac{n}{p})} \hat{a} (n) .

Amplitude spectrum after the compensation that obtains thus

Kept the spectrum envelope e (n) of original amplitude spectrum, thereby guaranteed that original resonance peak position is constant moving, can the adjustment of frequency not impacted simultaneously.Same thinking can be further with formula

\tilde{a} (n) = \frac{e (n)}{\hat{e} (n)} \hat{a} (n)

In e (n) change into resonance peak adjusted the new spectrum envelope in back, thereby realize change to the resonance peak position.

Ask for common the having of method of spectrum envelope: the method for linear predictive coding (LPC), the method for cepstral analysis, the method for low-pass filtering, discrete cepstrum method, and to method of local peak point interpolation or the like.In order to satisfy the requirement of real-time, the method for selection requires that lower complexity is arranged, and simultaneously, also will guarantee effect preferably.What adopt in this example is the method for improved cepstral analysis, experiment showed, that this method stability is strong, is applicable to multiple sound type, and calculates effect and calculated amount all meets practical requirement.

Above method has realized the independent of fundamental frequency and resonance peak position changed.

On this basis, the adjustment of the independence of the duration of a sound also becomes and is easy to realize.

Know that the sample frequency when playing by changing audio digital signals just can change the word speed of voice, has promptly changed the duration of a sound.So, can on time domain, make interpolation earlier or take out and cut voice signal data, under original sampling rate speed, to play, word speed has just obtained slowing down or accelerating.But meanwhile, the fundamental frequency of voice and resonance peak position also change simultaneously.If proportionally factor t is carried out interpolation to time-domain signal, then fundamental frequency cycles becomes original t doubly, and fundamental frequency just becomes 1/t, influences the resonance peak position simultaneously and also changes in the ratio of 1/t.

The method that has had the front that fundamental frequency and resonance peak position are independently changed, this moment if to fundamental frequency and resonance peak position in proportion factor t compensate simultaneously, just realized only changing the purpose of the duration of a sound.

Can be seen that by top discussion the adjustment of three kinds of physical characteristicss is the duration of a sound, fundamental frequency in proper order, be the resonance peak position then.Suppose that its ratio adjustment factor is followed successively by t, p, f, and the separately adjustment successively of three features, situation is as follows: factor t is adjusted the duration of a sound at first in proportion, simultaneously fundamental frequency and resonance peak position is compensated with factor t.Factor p adjusts fundamental frequency in proportion again, simultaneously to resonance peak position compensating factor 1/p.Factor f adjusts the resonance peak position in proportion at last.Therefore, finally be equivalent to adjust the duration of a sound with factor t earlier, adjust fundamental frequency with factor p*t again, use the factor at last

Adjust the resonance peak position of this moment, thereby realize independently adjusting with t, p, f respectively the purpose of three kinds of physical characteristicss.In the practical application, the adjustment of resonance peak can be simplified, and only needs to adjust f*t on initial position, as shown in Figure 1.

The adjustment of three kinds of physical characteristicss all is the interpolation by sample point and smokes and cut realization, in order to guarantee change of voice effect preferably, and under the prerequisite that satisfies voice conversion requirement, each scale factor is limited between 0.5～2.Experimental result shows, the adjustment of in this scope, making, and major part can both obtain gratifying effect.Be noted that the compensation of when adjusting resonance peak position and fundamental frequency the duration of a sound being adjusted simultaneously, this two resize ratio is become very big (substantially exceeding 2 times), cause many information lose or fuzzy.Therefore, when the resize ratio of resonance peak position or fundamental frequency is big, should not do big adjustment to the duration of a sound simultaneously.

The method of adjustment of loudness of a sound is as follows: Δ E _{I, n}, Δ E _{O, n}The energy value of (adjusting fundamental frequency and resonance peak, after the synthetic again time-domain signal) when the energy value of (before the spectrum analysis, after the duration of a sound adjustment) and output when representing the input of n frame signal respectively, E _{I, n}, E _{O, n}Be used for representing the gross energy of input signal before the n frame and the gross energy of output signal.Thereby have

E _i，n＝E _i，n-1+ΔE _i，n，

E _o，n＝E _o，n-1+ΔE _o，n.

Then, each data point D of n frame output signal _{N, k}Value is adjusted into by following formula

{\hat{D}}_{n, k} = D_{n, k} \cdot \sqrt{\frac{E_{i, n}}{E_{o, n}}} .

Top formula has guaranteed that signal and the conversion original energy before after the phonetic modification is consistent substantially, and promptly loudness of a sound remains unchanged.If need loudness of a sound is done the adjustment of a certain ratio, only need re-use this scale-up factor adjustment on this basis

Description of drawings

Fig. 1 is the duration of a sound of the present invention, fundamental frequency, resonance peak position adjustment detail flowchart;

Fig. 2 is signal analysis of the present invention and synthesis step schematic flow sheet;

Fig. 3 is a phase unwrapping synoptic diagram of the present invention.

Embodiment

The step that the duration of a sound of Fig. 1, fundamental frequency, resonance peak position are adjusted is as follows:

Step S1-1 carries out the interpolation of data point to a certain frame or takes out and cut according to adjusting factor t on time domain;

Step S1-2 transforms to frequency domain, and is transformed on the polar coordinates by rectangular coordinate, obtains phase spectrum I and amplitude spectrum II;

Step S1-3 extracts envelope to amplitude spectrum II, obtains envelope spectrum III, and III is carried out convergent-divergent according to adjusting factor t * f on frequency axis, obtains adjusting the envelope spectrum IV of resonance peak position;

Step S1-4 obtains V to amplitude spectrum II is point-to-point divided by envelope spectrum III, and the horizontal ordinate of spectrum V is carried out convergent-divergent according to adjusting factor t * p on frequency axis, point-to-pointly then multiply by adjusted envelope spectrum IV, obtains adjusted amplitude spectrum VII;

Step S1-5, to phase spectrum I, launch with the phase differential of consecutive frame, obtain between two frames actual value of phase change on each frequency, this is on duty to adjust factor t * p, then frequency axis is carried out convergent-divergent according to adjusting factor t * p, adjusted phase differential is added up again, obtain the adjusted phase spectrum VIII of present frame;

Step S1-6 transforms to rectangular coordinate with adjusted amplitude spectrum VII and phase spectrum VIII, the time domain of remapping.

The speech signal analysis of Fig. 2 is with synthetic, and its step is as follows:

Step S2-1 handles on time domain signal, comprises splice branch frame, interpolation, windowing etc.;

Step S2-2 is transformed into each frame that obtains on the time domain on the frequency domain by time-frequency conversion, handles on frequency spectrum, comprises adjusting fundamental frequency and resonance peak etc., returns to time domain again by the time-frequency inverse transformation then;

Step S2-3 carries out the window function compensation to each frame on time domain, with synthetic window function windowing, obtain complete time-domain signal after the splicing adding again.

The phase unwrapping of Fig. 3, the explanation such as the preamble of the concrete process of launching describe in detail.

In order to realize simulation and mutual conversion to male voice, female voice, child's voice and old man's sound, the present invention in the adjustment of each physical characteristics based on following explanation.

Under common situation of speaking, it is generally acknowledged that the fundamental frequency of child's voice is the highest, can reach about 300Hz, female voice is on a rough average near 220Hz, and the fundamental frequency of male voice is on average about 125Hz.Thus, can obtain the general proportions of the fundamental frequency of male voice, female voice and child's voice.Find in the practical application that the fundamental frequency ratio of female voice and male voice can have more satisfactoryly to effect between 1.5～1.8 usually, and the fundamental frequency ratio of child's voice and male voice must be more than 1.8.For simulation old man's sound, to reduce near the ratio of fundamental frequency to 0.6～0.9 usually, obtain real effect.

For resonance peak, usually, the resonance peak of male voice, female voice, child's voice roughly all has 6: 7: 8 simple relation.In the actual conditions, man, woman, child's voice are between each peak of different frequency height, and its ratio is not to be linear, and each peak difference that frequency is lower is bigger usually, and the higher then difference of frequency is little.Under the common application conditions, can ignore and not consider.For old man's sound, can think that its tone color is partial to male voice, so the adjusting ratio of its resonance peak is selected the numerical value less than 1 for use.

When the mutual conversion of men and women's sound, it has been generally acknowledged that word speed does not change, and, word speed can be slowed down slightly for old man's sound and child's voice, tally with the actual situation.

Embodiment

According to the method that the present invention proposes, on the pc platform, realized one can record, primary sound plays, and handles and simulate the demonstration program that male voice, female voice, old man's sound or child's sound are play in real time.This program is carried out the adjustment of the duration of a sound, fundamental frequency and resonance peak by predetermined resize ratio to this buffer zone speech data section to each the buffer zone elder generation pre-service in the play buffer formation, simulates man, woman, old man and child's sound respectively.And resize ratio that can the above-mentioned three kinds of features of manual setting, reach more satisfied simulate effect.This program has realized real-time processing, adjusts real-time play in real time.

At CPU is P4-2.4GHz, in save as under the test platform of 256M, the CPU usage when idle is 2%, CPU usage was about 10% when primary sound was play, and the change of voice in real time is when playing, CPU usage is about 22%.This change of voice method can receive within the scope fully to the requirement of processor, and has reached gratifying effect on tonequality.

Claims

1. one kind based on speech analysis and synthetic real-time change of voice method, based on Fourier analysis and synthetic technology, it is characterized in that, comprise the steps: on time domain, signal to be carried out interpolation or take out and cut, transform to frequency domain then, amplitude spectrum and phase spectrum are handled respectively according to the requirement of time span change, separate fundamental frequency and resonance peak, and it is carried out independent regulation, the two the influence to this of make-up time length adjustment recovers time-domain signal at last during adjusting; The set-up procedure of wherein said fundamental frequency, resonance peak position is as follows:

Step S1-3 extracts envelope to amplitude spectrum II, obtains envelope spectrum III, and III is carried out convergent-divergent by adjusting factor t * f on frequency axis, obtains adjusting the envelope spectrum IV of resonance peak position ,F represents the fundamental frequency adjustment factor;

Step S1-4, point-to-point to amplitude spectrum II divided by envelope spectrum III, obtain V, the horizontal ordinate of spectrum V is carried out convergent-divergent according to adjusting factor t * p on frequency axis, point-to-pointly then multiply by adjusted envelope spectrum IV, obtain adjusted amplitude spectrum VII, p represents that the resonance peak position adjusts the factor;

2. according to claim 1 based on speech analysis and synthetic real-time change of voice method, it is characterized in that, independent adjustment is carried out in fundamental frequency and resonance peak position to signal on frequency domain, fundamental frequency and resonance peak position are separated, the fundamental frequency and the harmonic wave thereof of voice signal both can have been changed, can keep the resonance peak position again simultaneously or the resonance peak position arbitrarily be adjusted the independent change of realization tone color and pitch.

3. according to claim 2 based on speech analysis and synthetic real-time change of voice method, it is characterized in that, directly the time span to voice signal changes on time domain, by interpolation or take out to cut digital signal realize is resampled, thereby elongate or shorten the time scale of voice, fundamental frequency and the resonance peak position that changes compensated, realize the effect that separately word speed is changed.

4. according to claim 1 based on speech analysis and synthetic real-time change of voice method, it is characterized in that, by asking for the spectrum envelope of amplitude spectrum, and based on this, the spectrum envelope of the new amplitude spectrum that carried out the adjusted spectrum signal of fundamental frequency is carried out in shape change, under the prerequisite that does not influence fundamental frequency, realize random adjustment to the resonance peak position.

5. according to claim 1 and 2ly it is characterized in that based on speech analysis and the synthetic real-time change of voice method speech analysis is as follows with synthetic step:

Step S2-1 handles on time domain signal, comprises splice branch frame, interpolation, windowing;

Step S2-2 is transformed into each frame that obtains on the time domain on the frequency domain by time-frequency conversion, handles on frequency spectrum, comprises adjusting fundamental frequency and resonance peak, returns to time domain again by the time-frequency inverse transformation then;