CN1189664A

CN1189664A - Sub-voice discrimination method of voice coding

Info

Publication number: CN1189664A
Application number: CN97100494A
Authority: CN
Inventors: 林进灯; 林信安
Original assignee: HETAI SEMICONDUCTOR CO Ltd
Current assignee: HETAI SEMICONDUCTOR CO Ltd
Priority date: 1997-01-29
Filing date: 1997-01-29
Publication date: 1998-08-05

Abstract

A sub-tone recognition method for speech encode include dividing the input speech tone frame into 4 sub-tone frames and executing the judgement of sound/dumb subtone to each sub-tone frame. The judgement precess includes comparing the orthonormalized value relative to each subtone frame with one upper critical value and one lower critical value, executing stable/unstable judgement to respective determine energy value and linear spectrum pair coefficient (LSP) of sub-tone frame, executing the judgement to energy ratio LOH of the sub-tone frame from low to high frequency ranges if the energy value and LSP is greater than preset critical values, and determining that the sub-tone frame is sound speech if LOH value is greater than critical value, or it is dumb speech.

Description

The consonant recognition methods of voice coding

The relevant a kind of voice coding method of the present invention, the consonant recognition methods of particularly sound/noiseless voice coding relevant for being used in a kind of speech coding technology differentiating.

In speech synthesis technique, generally use linear predictor speech coder LPC (Liner Predictive Coding Vocoder) technology.And in this linear predictor voice coding method, the LPC-10 speech coder is widely used in the voice compression technique of low bit rate.For a LPC speech coder, how correctly to differentiate input speech signal is to be very important problem for sound or noiseless consonant actually.Because the identifying of this sound/noiseless consonant, can have influence on the quality of synthesized voice signal widely.

Fig. 1 has shown the calcspar of traditional voice coding techniques.As shown in the figure, it includes a voice pulse generator 11 (Impulse Train Generator), a random noise signal generator 12 (Random Noise Generator), a sound/noiseless change-over switch 13 (Voiced/unvoiced Switch), a gain unit 14 (Gain Unit), a LPC wave filter 15 (LPC Filter), LPC FILTER TO CONTROL parameter setting unit 16.

The noise signal (While Noise) that periodic speech pulse signal that voice pulse generator 11 is produced (Periodic Impulse Train) or random noise signal generator 12 are produced, through a sound/noiseless change-over switch 13, type attribute according to its input signal, do suitably to switch after the selection, make signal gain to adjust its signal level through gain unit 14 earlier, carry out filtering by LPC wave filter 15 according to the default LPC parameter (LPC Parameters) of LPC FILTER TO CONTROL parameter again, export voice signal S (n) by the output terminal of LPC wave filter 15 at last.

When carrying out aforementioned sound discriminating step, this recognition device can upgrade its sound/silence distinguish method, pitch cycle (PitchPeriod), filter parameter (LPC Parameters) and yield value (Gain Value) at each sound frame (Speech Frame) of importing voice, its objective is in order to follow the trail of the changing condition of input voice.In existing typical technology, each sound frame includes 160 sub-samplings, that is in a pre-accordatura frame size, i.e. sampling in per 0.02 second once.

In aforementioned speech recognition,, be based on the foundation that the intensity of associated section distance is done to differentiate in the conventional art wherein about in sound/silence distinguish method.For example, if normalized crosscorrelation value (Normalized Cross correlation Value, be called for short the NC value) greater than a predefined critical value, for example 0.4 when above, then be about to this sound frame and judge it is to belong to normal voice signal, at this moment, voice operation demonstrator will excite the LPC wave filter with recurrent pulse.On the contrary, belong to noiseless consonant signal if this NC value during less than critical value 0.4, then can be differentiated this sound frame, voice operation demonstrator can excite this LPC wave filter with the random noise signal generator.Aforementioned NC value is defined as follows:

NC = \frac{Σ_{n = 0}^{N - 1} s (n) s (n - t)}{\sqrt{Σ_{n = 0}^{N - 1} s (n) s (n) Σ_{n = 0}^{N - 1} s (n - t) s (n - t)}}

Yet, for unsettled voice signal (that is in the difficult up and down definite accurate position of critical value), its NC value may be very little less than the degree of critical value 0.4, at this moment, aforesaid simple and easy method of discrimination, promptly can't accurately differentiate it and be voice signal or do not have acoustical signal, so in the application of reality, the situation of erroneous judgement is arranged possibly.

In order to overcome the problems referred to above, and the degree of accuracy of promoting its differentiation, therefore in known techniques, except the differentiation of aforementioned NC value, must additionally carry out the differentiation of speech signal energy, can reach degree comparatively accurately.

Therefore, in known techniques, develop the recognition methods of the sounding/noiseless consonant that another improvement.According to this another kind of known techniques, as the method for discrimination of speech signal energy, it comprises following two kinds of situations:

A, speech energy

Generally speaking, it is low that the energy of noise signal can have acoustic energy, and the root-mean-square value of its energy (RMS) is:

RMS = \sqrt{\frac{Σ_{n = 0}^{N - 1} s (n) s (n)}{N}}

Wherein N represents the whole sound frame of input speech signal.

B, zero correlation rate (ZC)

It is defined as the number of times of the zero correlation of whole sound frame, and it is defined as follows:

ZC = \frac{1}{2} Σ_{i = 2}^{N} abs [sgn (s (i - 1)) - sgn (s (i))]

In aforementioned speech coding technology, include 160 sub-samplings in each sound frame, each sound frame includes the LPC parameter of 34 bits, the pitch of 6 bits, the sounding/noiseless consonant of a bit and the yield value of 7 bits, is total up to 48 bits.

As previously mentioned, in voice coding, how correctly differentiating input speech signal is that sound or noiseless consonant is very important problem.Because this process has influence on the quality of synthesized voice signal widely.If in this sound/noiseless consonant differentiation process, when noiseless consonant erroneous judgement is speech sound, if then can to sound like be drone overcast sound to the synthetic speech of its output, and with the speech sound erroneous judgement during for noiseless consonant, and then can to sound like be to strike to hold up sound to the synthetic speech of its output.For this problem, in aforesaid conventional art, can't effectively solve.

Moreover in aforementioned second kind of conventional art, it is the state that decides the sound or noiseless consonant in the sound frame with a bit, contains the critical conditions that is stamped between sound/noiseless consonant with intention.So, because its whole sound frame in the critical zone, is not that being judged as sound promptly is noiseless consonant, so often make the synthetic speech of its output sound the noise sense.

Fundamental purpose of the present invention is to provide the consonant recognition methods of voice coding accurately of a kind of coding.

Another object of the present invention is to provide a kind of method that can accurately differentiate sound/noiseless consonant, by recognition methods of the present invention, can accurately determine input speech signal middle pitch frame is sound or noiseless consonant.

The method of discrimination (Quarter Voiced/Unvoiced Decision Scheme) of three of purpose of the present invention provides a kind of four sections sound/noiseless consonant, it is divided into four consonant frames (Subframe) with each the sound frame in the input speech signal, then at each consonant frame, according to its correlation parameter, and this sound frame of comprehensive distinguishing is sound or noiseless consonant, therefore it differentiates the result, can obtain accurate, a natural voice signal output at the phonetic synthesis output terminal.

For achieving the above object, the present invention takes following scheme:

In the step of the present invention, at first the sound frame of input speech signal is divided into after four consonant frames, whether the NC value (normalized crosscorrelation value) of differentiating these four consonant frames in regular turn is more than or equal to a high limit critical value (for example 0.7), and then differentiate this NC value whether less than a lower bound critical value (for example 0.4), through after the differentiation of aforementioned two steps, can be determined the signal that obviously belongs to sound and do not have between phonon, next promptly be differentiate between aforementioned obviously sound/signal between the noiseless consonant, the discriminating step in this stage includes: if in abovementioned steps, determine the NC value not less than the lower bound critical value, then carry out stable/unsettled discriminating step, differentiate the energy value of this consonant frame and line frequency spectrum size respectively coupling coefficient (LSP) value; If energy value and LSP value during not greater than the critical value preset, judge that then this voice signal is steady state (SS), all be set at the attribute of four consonant frames identical with the sound/noiseless consonant state of last consonant frame in the previous sound frame; If when in abovementioned steps, determining energy value and LSP value greater than preset critical, then carry out the discriminating step of the paramount frequency range energy ratio of the low-frequency range LOH of this consonant frame, whether the LOH value of judging each consonant frame is greater than a critical value, if greater than critical value, judge that then this consonant frame is the speech sound signal; If not, judge that then this consonant frame is noiseless consonant signal, next consonant frame is differentiated, till the whole differentiations of four consonant frames finish.

Details are as follows to the present invention for conjunction with figs. and embodiment:

Brief Description Of Drawings:

Fig. 1 is the basic calcspar of traditional voice coding techniques.

Fig. 2 is a differentiation process flow diagram of the present invention.

Fig. 3 is among the present invention, four coding schedules that the consonant frame is encoded with 3 bits.

In method of discrimination of the present invention, be that each the sound frame in the input speech signal is divided into four consonant frames (Subframe), then at each consonant frame, according to its correlation parameter, each consonant frame of comprehensive distinguishing is sound or noiseless consonant.Aforementioned parameters include NC, energy, line frequency spectrum to coupling coefficient (line Spectrum Pair is called for short LSP) and the paramount frequency range energy ratio of low-frequency range (Low to High Band Energy RatioValue, LOH).

Below be discriminating step of the present invention.As shown in Figure 2, it is a differentiation process flow diagram of the present invention, and its step includes: after flow process began step 101, at first execution in step 102, obtained present sound frame data (Current Frame).Then at first carry out and whether differentiate the NC value more than or equal to a high step 103 of limitting critical value 0.7, the explanation of stating as defined above of this NC value.If the discrimination result is yes, then execution in step 104, judge that four consonant frames in the present sound frame of this input all belong to audible signal, and discriminating program promptly finishes then.

If in abovementioned steps 102, determine the NC value, it then continues whether to differentiate this NC value less than a lower bound critical value 0.4 in step 105 not more than or equal to high limit critical value 0.7, if, judge that then four consonant frames in this sound frame all belong to noiseless consonant signal, discriminating program promptly finishes then.

After aforementioned two steps 102,103 differentiations, can be determined the signal that obviously belongs to sound and noiseless consonant, next promptly be differentiate between aforementioned obviously sound/signal between the noiseless consonant, in this instability, transient state zone, owing to can't carry out the accurate differentiation of sound/noiseless consonant separately by the NC value determining step in

step

102 and 103, therefore must be by following method of discrimination, reach its intended purposes of the present invention, so following discriminating step is a very critical step of the present invention.

If in abovementioned steps 105, determine the NC value not less than 0.4, then in step, carry out stable/unsettled discriminating step (Stationay/nonstationary Decision, be called for short S/NS Decision), in this step, include two differentiation projects, one of them is differentiated for energy, and it differentiates the difference of previous energy (Previous Energy) and present energy (Current Energy), be dis (PrEng, CuEng).In order further to increase the differentiation degree of accuracy of S/NS, more comprised the differentiation of LSP coefficient in this step, this LSP coefficient is obtained by the LPC eqalizing cricuit.In the differentiation of this LSP coefficient, be to obtain previous average LSP (Past averageLSP) and the present difference of LSP (Current LSP), promptly dis (PaLSP, CuLSP).

In the S/NS of step 107 discriminating step:

(PaLSP is CuLSP) more than or equal to 0.45 for a, dis; And

(PrEng is CuEng) more than or equal to 0.4 for b, dis;

If the result is not for, the expression voice signal is steady state (SS), and then execution in step 108, all is set at the attribute of four consonant frames identical with the sound/noiseless consonant state of last consonant frame in the previous sound frame.

On the contrary, if in the difference discriminating step of abovementioned steps 107, if the result is for being (that is the variation of expression energy or LSP coefficient is very fast), then carry out the discriminating step (comprising that step 109 is to 113) of LOH, each consonant frame is carried out the discriminant classification of sound/noiseless consonant, to obtain more accurate differentiation result, so-called LOH differentiation is defined as:

LOHI (i) = \sqrt{\frac{\frac{1}{W} Σ_{k = - w / 2}^{w / 2 - 1} s^{2} lplk (k + d_{offset} (i))}{\frac{1}{W} Σ_{k = - w / 2}^{w / 2 - 1} s^{2} hplk (k + d_{offset} (i)) + T_{sil}}}

Wherein i represents i consonant frame

S ₂Resulting signal was lower than 1KHz and is divided by with a form length W to the energy ratio that is higher than 1KHz in definition after 1plk represented original signal through the 1k low-pass filter in the voice signal, and its so-called form length W is defined as follows:

If W=pitch is the big what N of pitch _Subframe

If W=2*pitch is N _Subframe/ 2≤pitch＜N _Subframe

Wherein pitch is pitch, and N _SubframeThe consonant frame length that expression is taken a sample.

In addition, in the LOH definition, also select the maximum speech value of a quiet critical value Tsil as present sound frame, this Tsil value can be injected towards in the voice signal energy of process 1KHz high-pass filtering, so that low-energy audible signal tends to be chosen as noiseless consonant.

d _OffSet(j) center of each consonant frame, it is defined as:

d _OffSet(j)＝N _subframe*(j-1/2)，j＝1～4

Wherein j represents the number of consonant frame.

Differentiate in the flow process at LOH of the present invention, whether step 110 at first differentiates the LOH (see before and state definition) of first consonant frame greater than 1, if greater than 1, then execution in step 112, judge that this consonant frame is the speech sound signal; If not, then execution in step 111, judge that this consonant frame is noiseless consonant signal.Then, in

step

113 and 109 circulations, next consonant frame is differentiated again, till the whole differentiations of four consonant frames finish.That is after differentiating through above-mentioned LOH, for the LOH value of each consonant frame, as if greater than a critical value, then this consonant frame promptly is judged as soundly, otherwise promptly is judged as noiseless consonant.Four consonant frames treating a sound frame all judge finish after, can make cataloged procedure according to the result.In the present invention, only need 3 bits to be encoded in four consonant frames, as shown in Figure 3, wherein " 1 " expression is sound, and " 0 " represents noiseless consonant.

After obtaining index value shown in Figure 3, the value of correspondence is stored, promptly finish the process of coding, then, in the application of reality, promptly can decode by known speech synthesis technique, and produce required synthetic speech.

In sum, the present invention has following effect:

Owing to increased stable/unsettled discriminating step in the identification step of the present invention, and increased the precision to speech signal energy identification, so just can improve the accuracy of speech signal coding.

Claims

1, a kind of consonant recognition methods of voice coding in order to discern the sound frame information attribute of input voice, is characterized in that this method includes the row step:

A, obtain present input voice sound frame data and be divided into four consonant frames;

B, differentiate four consonant frames in regular turn the NC value whether more than or equal to a high limit critical value, if the discrimination result is yes, judge that then four consonant frames in the present sound frame of this input all belong to audible signal;

C, if the NC value that determines the consonant frame in step b is limit critical value more than or equal to height, whether then differentiate this NC value less than a lower bound critical value, if judge that then four consonant frames in this sound frame all belong to noiseless consonant signal;

D, if in step c, determine NC value not less than the lower bound critical value, then carry out and stablize/unstable discriminating step, differentiate the energy value of this consonant frame and line frequency spectrum size respectively to the coupling coefficient value;

E, if energy value and LSP value during not greater than the critical value preset, judge that then voice signal is steady state (SS), all be set at the attribute of four consonant frames identical with the sound/noiseless consonant state of last consonant frame in the previous sound frame; If

Among f, the step e, if energy value and LSP value during greater than preset critical, are then carried out the discriminating step of the paramount frequency range energy ratio of the low-frequency range LOH of this consonant frame, whether the LOH value of judging each consonant frame is greater than a critical value, if greater than threshold value, judge that then this consonant frame is the speech sound signal; If not, then judge the signal that this consonant frame is noiseless consonant, next consonant frame is differentiated, till the whole differentiations of four consonant frames finish.

2, the consonant recognition methods of voice coding according to claim 1 is characterized in that, described step b is when differentiating the NC value of consonant frame, and the height of setting limit critical value is to be 0.7.

3, the consonant recognition methods of voice coding according to claim 1 is characterized in that, described step c is when differentiating the NC value of consonant frame, and the lower bound critical value of setting is 0.4.

4, the consonant recognition methods of voice coding according to claim 1, it is characterized in that, in stable/unsettled discriminating step of described steps d, wherein the differentiation of the energy value of consonant frame is to judge that whether the difference of previous energy and present energy is more than or equal to a default critical value.

5, the consonant recognition methods of voice coding according to claim 4, its feature exists, and in the discriminating step of the described energy value that carries out the sound frame, the critical value of setting is to be 0.45.

6, the consonant recognition methods of voice coding according to claim 1, it is characterized in that, in stable/unstable discriminating step of described steps d, the differentiation of the LSP coefficient of described consonant frame is a difference of judging previous average LSP coefficient and present LSP coefficient value.

7, the consonant recognition methods of voice coding according to claim 6 is characterized in that, in the discriminating step of the described LSP coefficient value that carries out the sound frame, the critical value of setting is 0.4.

8, the consonant recognition methods of voice coding according to claim 1 is characterized in that, described step f is in the paramount frequency range energy ratio of the low-frequency range LOH discriminating step of carrying out the consonant frame, and described LOH is defined as:

LOHI (i) = \sqrt{\frac{\frac{1}{W} Σ_{k = - w / 2}^{w / 2 - 1} s^{2} lplk (k + d_{offset} (i))}{\frac{1}{W} Σ_{k = - w / 2}^{w / 2 - 1} s^{2} hplk (k + d_{offset} (i)) + T_{sil}}}

Wherein i represents i consonant frame

S ₂Resulting signal after 1p1k represents original signal through the 1k low-pass filter in definition, is lower than 1KHz and is divided by with a form length W to the energy ratio that is higher than 1KHz in the voice signal, its form length W is defined as follows:

If W=pitch is the big what N of pitch _Subframe

If W=2*pitch is N _Subframe/ 2≤pitch＜N _Subframe

Wherein pitch is pitch, and Nsubframe represents the consonant frame length of being taken a sample;

Quiet critical value Tsil wherein is as the maximum speech value of present sound frame, and described Tsil value can be injected towards in the voice signal energy of process 1KHz high-pass filtering, so that low-energy audible signal tends to be chosen as noiseless consonant; D wherein _Offset(j) when the center of each consonant frame, it is defined as:

d _offset(j)＝N _subframe*(j-1/2)，j＝1～4

Wherein j represents the number of consonant frame.