CN102982801B

CN102982801B - Phonetic feature extracting method for robust voice recognition

Info

Publication number: CN102982801B
Application number: CN201210449436.XA
Authority: CN
Inventors: 徐波; 范利春; 柯登峰; 孟猛
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2012-11-12
Filing date: 2012-11-12
Publication date: 2014-12-10
Anticipated expiration: 2032-11-12
Also published as: CN102982801A

Abstract

The invention discloses a phonetic feature extracting method for robust voice recognition. The method comprises the steps of obtaining a power spectrum, handling the power spectrum by a filter set, obtaining a medium length power spectrum by means of frame averaging, carrying out unsymmetrical filtration process to the power spectrum, meanwhile carrying out a sheltering process to the power spectrum, obtaining a pure voice power spectrum, carrying out a channel averaging process to a ratio of the pure voice power spectrum and a voice power spectrum with noise so as to be smoothed, multiplying the smoothed ratio of the pure voice power spectrum and the voice power spectrum with noise by a power spectrum output by a filter bank so as to obtain a short power spectrum of pure voice, carrying out an energy normalization process to the short power spectrum so as to eliminate multiplicative noise, carrying out equal loudness aggravation to the power spectrum, carrying out an index operation to the power spectrum, carrying out an inverse Fourier transform to the power spectrum, obtaining cepstrum coefficients of signals, and carrying out a mean value normalization process to the cepstrum coefficients. The voice signals extracted by the method are high in speed and capable of achieving an online process. An acoustic model trained by characteristics extracted by the method has a good noise proof effect. The phonetic feature extracting method for robust voice recognition is of great importance to use.

Description

A kind of Speech Feature Extraction for robust speech identification

Technical field

The present invention relates to field of speech recognition, relate in particular to one in speech recognition, can have obvious inhibiting speech feature extraction algorithmic method to steady and nonstationary noise.

Background technology

It is one of sixty-four dollar question in speech recognition that speech recognition system recognition performance under complex environment sharply reduces.For example mobile phone speech polling geographic position on road, the residing acoustic enviroment of user is very complicated and variation is rapid, and this performance on speech recognition system has produced great impact.Original speech recognition system can obtain good processing and recognition effect not having under noisy environment, but in real world applications the performance of recognition system can due to time become uncertain environmental noise and the impact of channel, speaker's difference, the factors such as the variation of conversation content affect degradation.Become the key of speech recognition technology so how to improve the robustness of speech recognition system under the mismatch condition of training and testing environment.

In recent years, propose a lot of improvement technology and calculating methods this research field of speech recognition technology environmental robustness people, and obtained certain effect.According to the flow process of speech recognition, robust speech identification can be divided into four classes: the anti-noise of time-frequency domain; The noise compensation of property field; The noise self-adaptation in model territory and the self-adaptation of decoded domain.Technology is the earliest the anti-noise of time-frequency domain, and for example spectrum subtracts and Wiener filtering, also has two stage Wiener filterings of classical ETSI.Feature aspect squelch normally compensates noise in the process of extracting feature.Because PLP and MFCC feature occupy huge legendary turtle head always, thus the squelch of feature aspect mostly in these two kinds of features, carry out, such as vectorial Taylor series etc.Three phases is to carry out self-adaptation in model side's face of noise, comprises the HMM of multi-mode speech model, shared variable parameter etc.Four levels is the noise self-adaptation in decoding aspect, comprises uncertain decoding and replaces uncertain decode etc. with subband revaluation.

These all methods are all to seek under certain criterion unmatched a kind of the optimal compensation between training environment and test environment essentially.In a series of supposed premise conditions, as the independence between the independence of the Gaussian distribution of additivity noise, noise and voice signal, different noise, roll-off characteristic of channel etc., these methods have all been made useful exploration and contribution for the robustness of speech recognition technology, especially under stationary noise environment, have good noise suppression effect.But this also has very large gap with the application requirements of speech recognition system under true noisy environment, therefore for complex environment more, such as the environment such as burst noise are helpless.

Summary of the invention

(1) technical matters that will solve

Low in order to solve the above-mentioned phonetic recognization rate under complex environment, and the common feature extracting method shortcoming strong not to the inhibition ability of nonstationary noise, the present invention proposes a kind of feature extracting method that can improve its discrimination, object is to improve the discrimination with the voice of the additive noise such as burst noise and music noise impact, and the phonetic recognization rate under pure environment is not declined.

(2) technical scheme

The present invention based on a kind of Speech Feature Extraction for robust speech identification, comprise the following steps to realize:

Step 1, obtain the power spectrum of voice signal;

Step 2, by obtained power spectrum by bank of filters processing, obtain the short-time rating spectrum of noisy speech;

Step 3, according to the short-time rating spectrum of obtained noisy speech, adopt the average mode of frame to ask for the medium duration power spectrum of noisy speech;

Step 4, the medium duration power spectrum of obtained noisy speech is carried out asymmetric filtering and shelters anti-noise, to obtain the medium duration power spectrum of clean speech;

Step 5, obtain the short-time rating spectrum of clean speech according to the short-time rating spectrum of the medium duration power spectrum of the medium duration power spectrum of described clean speech, noisy speech and noisy speech;

Step 6, the short-time rating spectrum of clean speech is carried out to energy normalized processing, to eliminate the property taken advantage of noise;

Step 7, offset and the loudness such as carry out except the short-time rating spectrum of the clean speech of the property taken advantage of noise and increase the weight of;

The short-time rating spectrum of the clean speech after step 8, reciprocity loudness increase the weight of is carried out index nonlinear operation;

Step 9, the short-time rating spectrum of having carried out the clean speech after index nonlinear operation is carried out to inverse fourier transform, to ask for cepstrum coefficient, cepstrum coefficient is carried out to average normalized, finally obtain phonetic feature.

The present invention starts with from traditional Speech Feature Extraction, for the shortcoming a little less than traditional voice feature anti-noise ability, has proposed some means and has improved phonetic feature, finally forms a set of new Speech Feature Extraction.The present invention is directed to noise and change the feature slower than voice, utilize the average mode of frame short-time rating to be composed to the power spectrum that is converted to medium duration, for estimating noise; Utilize the mode of asymmetric filtering, estimate respectively the spectrum envelope of noise and voice in noisy speech; On the basis of asymmetric filtering, adopt the mode estimated snr of sheltering, and it is processed, the signal to noise ratio (S/N ratio) that is converted into short-time rating spectrum is carried out anti-noise; Also by energy normalized with index is non-linear that power spectrum is processed.The Speech Feature Extraction for robust speech identification that the present invention proposes not only can be estimated more accurately to noise, also can make phonetic feature more meet the auditory properties of people's ear.Therefore the feature that this feature extracting method is asked for has good inhibiting effect to noise.

(3) beneficial effect

The present invention starts with from traditional Speech Feature Extraction, in traditional Speech Feature Extraction, add anti-noise to process and meet the conversion process of human auditory system, make this feature extracting method various additive noises of not only can drawing up, and discrimination under pure environment is also higher than traditional Speech Feature Extraction.

Brief description of the drawings

Fig. 1 is the main-process stream block diagram of the present invention for the Speech Feature Extraction of robust speech identification;

Fig. 2 is the structure flow chart that comprises the asymmetric low-pass filtering anti-noise module of sheltering;

Fig. 3 is the structure flow chart of masking block in Fig. 2.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Fig. 1 is the main-process stream block diagram of the present invention for the Speech Feature Extraction of robust speech identification.As shown in Figure 1, a kind of Speech Feature Extraction for robust speech identification that the present invention proposes is mainly made up of following flow process: voice signal is carried out to pre-emphasis; To voice windowing and adopt short time discrete Fourier transform to ask for voice spectrum; To voice spectrum square, ask for power spectrum; Adopt bank of filters to process power spectrum, to obtain the short-time rating spectrum of noisy speech; Adopt the average mode of frame to ask for the medium duration power spectrum of noisy speech; Medium duration power spectrum to asked for noisy speech carries out asymmetric low-pass filtering treatment, follows the tracks of the noise in voice, asked for medium duration power spectrum is sheltered to processing simultaneously, obtains the short-time rating spectrum of clean speech; The ratio of the power spectrum to clean speech and noisy speech carries out passage average treatment, level and smooth to carry out; The power spectrum of the clean speech after level and smooth and noisy speech is multiplied each other with the short-time rating spectrum of the noisy speech of bank of filters output, obtain the short-time rating spectrum of clean speech; Short-time rating spectrum to clean speech is carried out energy normalized processing, to eliminate multiplicative noise; Short-time rating spectrum after normalization is carried out etc. to loudness and increase the weight of, make it meet human auditory system effect; Then by etc. the power spectrum utilization index operation of loudness after increasing the weight of carry out the conversion of intensity loudness, make it meet people's physiological characteristic; Afterwards the power spectrum after the conversion of intensity loudness is carried out to inverse fourier transform; The result obtaining according to inverse fourier transform is again asked for cepstrum coefficient; Finally the cepstrum coefficient of asking for is carried out to average normalized, finally obtain the phonetic feature of the inventive method.Below each step of invention is specifically set forth.

One, voice signal is carried out to pre-emphasis

The object of pre-emphasis is the impact that weakens low-frequency disturbance, the main composition of outstanding high-frequency signal.Conventionally use following formula to carry out pre-emphasis to speech sample point:

y_{t} = \{\begin{matrix} x_{t} - α \cdot x_{t - 1} & ift > 0 \\ (1 - α) \cdot x_{t} & ift = 0 \end{matrix} - - - (1)

Wherein α is called as pre emphasis factor, and x is speech sample point, and y is the speech sample point value after pre-emphasis, the index that t is sampled point.

Two, to the voice signal windowing after described pre-emphasis and to adopt short time discrete Fourier transform to ask for voice spectrum voice signal be a continuous time varying signal, for voice are analyzed and researched, conventionally intercept one section of voice, think that voice are steady-state signals in this section, and this section of voice are called to a frame.In order to reduce truncation effect, take advantage of a window conventionally to this section of voice, common are Hanning window and hamming window.A frame voice signal after windowing is carried out to short time discrete Fourier transform and just can obtain the frequency spectrum of these frame voice.Specifically comprise: divide frame to voice, wherein frame length scope is 20ms～30ms, and it is 10ms～15ms that frame moves scope value; Each frame to voice carries out windowing, adopts Hanning window or hamming window; Voice after windowing are carried out to short time discrete Fourier transform, adopt original Fourier to change formula, or utilizing fast fourier transform to solve to the integer power of the voice zero padding to 2 after windowing obtains voice spectrum.

Three,, to voice spectrum square, ask for power spectrum

In order to obtain the power spectrum P (w) of voice signal, the real part after short time discrete Fourier transform and imaginary part are asked for respectively a square also summation by we.Formula is as follows:

P(w)＝Re[S(w)] ²+Im[S(w)] ² (2)

Wherein S (w) represents fourier spectra in short-term, Re[S (w)] and Im[S (w)] real part and the imaginary part of fourier spectra in short-term represented respectively.

Four, adopt bank of filters to process power spectrum

People's ear has different perceptions to the voice of different frequency, experiment discovery, and below 1000Hz, perception and frequency are linear, and more than 1000Hz, perception becomes logarithmic relationship with frequency.In order to simulate the apperceive characteristic of people's ear to different frequency voice, conventionally adopt bank of filters to change linear spectral.The bank of filters adopting can be Mel bank of filters (Mel-filterbank) or Gamma-tone bank of filters, and port number can be chosen different numbers according to different wave filters;

In a preferred embodiment of the invention, adopt Gamma-tone bank of filters.It has some passages, being distributed in equivalent rectangular bandwidth of the centre frequency linearity of these passages.

The short-time rating that utilizes so Gamma-tone bank of filters summation to obtain noisy speech compose as shown in the formula:

P [m, l] = Σ_{k = 0}^{(K / 2) - 1} {| X [m, e^{j w_{k}}] H_{l} (e^{{jw}_{k}}) |}^{2} - - - (3)

Wherein m and l represent respectively the index of frame and passage, and what K was Fourier transform counts, w _k=2 π/F _s, F _srepresent the sample frequency of voice signal. represent voice m frame the the amplitude of frequency, represent l passage the Gamma-tone filter value of frequency.

Five, adopt the average mode of frame to ask for medium duration power spectrum

Because the variation of noise is often slow than the variation of voice, therefore need to ask for a window longer than general window when estimating noise, in feature extracting method of the present invention, the average that adopts the average mode of frame to obtain several windows is described a longer window.But can not all use so long window to all voice, be because the long too large words of window can make phonetic recognization rate reduce.The formula that frame is on average asked for the medium duration power spectrum of noisy speech is expressed as follows:

Q [m, l] = \frac{1}{2 M + 1} Σ_{m^{'} = m - M}^{m + M} P [m^{'}, l] - - - (4)

Wherein m and l represent respectively the index of frame and passage, when M represents to ask for medium duration, and the frame number sum got forward and backward respectively.

Six, the medium duration power spectrum of noisy speech is carried out asymmetric filtering and shelters anti-noise

Because noise is very fast in the variation of some frequency, so for tracking noise more accurately, need to carry out not identical processing to the noise of different passages, therefore introduce herein and comprised the asymmetric low-pass filtering anti-noise module of sheltering.Idiographic flow as shown in Figure 2.

Fig. 2 is the structure flow chart that comprises the asymmetric low low-pass filtering anti-noise module of sheltering in the present invention.

In Fig. 2, first asymmetric low-pass filter can be described with following formula:

Q_{le} [m, l] = \{\begin{matrix} λ_{a} Q_{le} [m - 1, l] + (1 - λ_{a}) Q [m, l], ifQ [m, l] &GreaterEqual; Q_{le} [m - 1, l] \\ λ_{b} Q_{le} [m - 1, l] + (1 - λ_{b}) Q [m, l], ifQ [m, l] < Q_{le} [m - 1, l] \end{matrix} - - - (5)

Wherein λ _aand λ _bfor adjustable parameter, span is (0～1).The Q obtaining by above formula _le[m, l] same to Q[m, l] subtract each other after, then obtain Q through a half-wave rectification block _o[m, l].Subtract each other with the concrete operations of rectification as shown in formula (6).

Q _o[m，l]＝max(Q[m，l]-Q _le[m，l]，0) (6)

Q _o[m, l] is sent to respectively masking block and second asymmetric low-pass filter.Second asymmetric low-pass filter is identical with first above-mentioned asymmetric low-pass filter, and inner structure still can obtain with formula (5), and just input is by Q[m, l] become Q _o[m, l], output is by Q _le[m, l] become Q _f[m, l].Second value Q that asymmetric low-pass filter produces _f[m, l] will serve as spectrum end power, the i.e. minimum value of power spectrum.Second asymmetric low-pass filter is too little and cause unnecessary music noise in order to prevent asymmetric filtering and to shelter the value of output valve of mass action.On the other hand, Q _o[m, l] obtains Q through masking block _tm[m, l], this step describes in detail in the back.Q _tm[m, l] and Q _f[m, l] is input to maximal value module jointly, utilizes following formula to obtain R _sp[m, l]:

R _sp[m，l]＝max(Q _tm[m，l]，Q _f[m，l]) (7)

Determine that finally by crossing selective switch result exports R[m, l] value.This optionally switch determined by following formula:

R [m, l] = \{\begin{matrix} R_{sp} [m, l], & ifQ [m, l] &GreaterEqual; {cQ}_{le} [m, l] \\ Q_{f} [m, l], & ifQ [m, l] < {cQ}_{le} [m, l] \end{matrix} - - - (8)

Wherein c is adjustable parameter, as selected c=2.The meaning of this formula is, if the medium duration power of a sound bite can not be greater than c=2 himself spectrum end power doubly, just thinks that this section of voice are mute state, and therefore output valve should be spectrum end power.

Processing procedure recited above is all to be realized by the asymmetric low-pass filtering anti-noise module of sheltering that comprises of describing in Fig. 2.Specifically describe the masking block in Fig. 2 below.Its structure as shown in Figure 3.First input Q _o[m, l] obtains Q through MAX module _p[m, l], formula is as follows:

Q _p[m，l]＝max(λ _tQ _p[m-1，l]，Q _o[m，l]) (9)

Wherein λ _tbe Forgetting coefficient, span is (0～1).The output valve Q that masking block is last _tm[m, l] determined by selective switch, and its formula is described below:

Q_{tm} [m, l] = \{\begin{matrix} Q_{o} [m, l] & Q_{o} [m, l] &GreaterEqual; λ_{t} Q_{p} [m - 1, l] \\ μ_{t} Q_{p} [m - 1, l] & Q_{o} [m, l] < λ_{t} Q_{p} [m - 1, l] \end{matrix} - - - (10)

Wherein μ _tfor corresponding parameter, span is (0～1).The output valve Q of masking block _tm[m, l] is with the output Q of second dissymetrical filter _fthe maximal value module that [m, l] describes through publicity (7) has obtained R _sp[m, l], R _sp[m, l] is last with spectrum end Q _fafter the selective switch that [m, l] describes through formula (8), finally obtain the medium duration power spectrum of noisy speech to carry out asymmetric filtering and shelter the result R[m after anti-noise, l].

By description above, calculate asymmetric filtering and sheltered the output R[m after anti-noise, l], this value represents the medium duration power spectrum of clean speech, the medium duration power spectrum of noisy speech Q[m after it and frame are average, l] ratio clean speech power proportion in noisy speech power spectrum can be described, we are with H[m, l] represent.Be formulated as follows:

H (m, l) = \frac{R [m, l]}{Q [m, l]} - - - (11)

Seven, the average and anti-noise of passage is integrated

Because the threshold value between passage and passage is different, and to process be usually also based on a sound bite, between passage, is therefore smoothly necessary.It is average that we carry out passage with following formula, obtains passage average weight H _s[m, l]:

H_{s} [m, l] = (\frac{1}{l_{2} - l_{1} + 1} Σ_{l^{'} = l_{1}}^{l_{2}} H [m, l]) - - - (12)

Wherein l ₂=min (l+N, L), l ₁=max (l-N, 1), L represents the number of filter channel, N represents to ask for passage average time, the total number of channels of looking around forward and backward.Described through the average weight H of passage _s[m, l] is used for modulating the short-time rating spectrum of noisy speech, and to obtain the short-time rating spectrum of clean speech, formula is as follows:

T[m，l]＝P[m，l]H _s[m，l] (13)

Eight, the short-time rating spectrum of clean speech is carried out to energy normalized processing, to eliminate multiplicative noise

In traditional feature extraction algorithm such as MFCC, for matching people's physiological property, adopt logarithm operation, so just the property taken advantage of in feature extraction algorithm is operated to the noise bringing and has become the information of additivity, finally can remove by average normalization.But in feature extracting method of the present invention, adopt the operation of index to carry out matching people's physiological property, the property taken advantage of operates the noise bringing and can not eliminate by average normalization like this, therefore adds this step, in order that can eliminate this multiplicative noise.

Because feature extracting method of the present invention is online feature, therefore can not obtain the average of all frames.In the present invention, adopt the average dynamically updating to replace the average of whole piece voice, formula is as follows:

μ [m] = λ_{μ} μ [m - 1] + \frac{1 - λ_{μ}}{L} Σ_{l = 0}^{L - 1} T [m, l] - - - (14)

Wherein L represents the number of filter channel, λ _μrepresent Forgetting coefficient, span is (0～1).Utilize this average to be normalized the impact that just can eliminate multiplicative noise to the clean speech short-time rating spectrum of each passage.The formula of this step is as follows:

U [m, l] = k \frac{T [m, l]}{μ [m]} - - - (15)

Wherein k is arbitrary constant, utilizes so online processing, can make online feature reach the effect of off-line.

Nine, the short-time rating spectrum of energy normalized clean speech after treatment being carried out etc. to loudness increases the weight of

When the sound such as different frequency, their sound pressure is different.For this deviation of compensator's ear to frequency, need to be to loudness pre-emphasis processing such as power spectrum carry out.Conventionally this passage is compensated as the frequency of this passage by the centre frequency of each passage, and that the formula compensating has is varied, in the present invention, adopt etc. loudness weight formula as follows:

E (w_{l}) = \frac{(w_{l}^{2} + 1.44 \times 10^{6}) w_{l}^{4}}{{(w_{l}^{2} + 1.6 \times 10^{5})}^{2} \times (w_{l}^{2} + 9.61 \times 10^{6})} - - - (16)

Wherein w represents frequency, and l represents the index of passage, w _lthe frequency of l passage, i.e. the centre frequency of l passage.

The short-time rating spectrum of energy normalized clean speech after treatment is carried out etc. to loudness and increases the weight of to adopt formula below:

O[m，l]＝U[m，l]·E(w _l) (17)

Wherein m, l is respectively the index of frame and passage.

Ten, the clean speech short-time rating spectrum of reciprocity loudness after increasing the weight of carried out index operation

For better matching people's auditory model, intensity is converted into loudness, need to carry out nonlinear compression to power spectrum, in traditional PLP feature, adopt the non-linear of cubic root; And in traditional MFCC, adopted the nonlinear mode of logarithm.In feature extracting method of the present invention, adopt the nonlinear mode of index.Formula is as follows:

L[m，l]＝O[m，l] ^θ (18)

Wherein θ is the nonlinear parameter of index.

11, the clean speech short-time rating spectrum after index nonlinear transformation is carried out to inverse fourier transform

It is the cepstrum coefficient in order to ask for voice signal that clean speech short-time rating spectrum after index nonlinear transformation is carried out to inverse fourier transform, and then obtains phonetic feature.The Fourier inversion has here adopted basic inverse fourier transform method.

12, ask for the cepstrum coefficient of signal

In order to obtain cepstrum coefficient, in method of the present invention, first adopt Durbin recursive algorithm to ask for linear predictor coefficient, then utilize the linear predictor coefficient of asking for, obtain corresponding cepstrum coefficient according to recursion formula.Recursion formula is as follows:

c_{n} = \{\begin{matrix} a_{n} + Σ_{m = 1}^{n - 1} {kc}_{m} a_{n - m} / n, if 1 \leq n \leq p + 1 \\ a_{n} + Σ_{m = n - p}^{n - 1} {kc}_{m} a_{n - m} / n, ifn > p + 1 \end{matrix} - - - (19)

Wherein a is linear predictor coefficient, and k is reflection coefficient, and they are all to be tried to achieve according to the autocoorrelation of inverse fourier transform described in step 11 by Durbin recursive algorithm, and n is the index to spectral coefficient in addition, and p is the exponent number of model.

13, cepstrum coefficient is carried out to average normalization

Although carried out energy normalized in step 8, average normalization is still necessary, at least average normalization can not bring negative impact.Average normalization is the mean value of all dimensions of cepstrum coefficient being asked for respectively to all frames, then every one dimension of each frame cepstrum coefficient is all deducted to the average to spectral coefficient of respective dimension.Because feature extracting method of the present invention is online, therefore average is also that all frames before present frame are averaging.

Below in conjunction with accompanying drawing, the example of a kind of Speech Feature Extraction for robust speech identification of the present invention is described, for the voice of 16KHz sample frequency, be specifically described as follows.

1. pair voice signal carries out pre-emphasis, increases the weight of factor alpha and adopts 0.97.System function is as shown in formula (1).

2. voice frame length adopts 25ms, and frame moves and adopts 10ms, adds hamming window, and by 400 afterbody zero padding to 512 points of frame voice, then adopts fast fourier transform to ask for voice spectrum.

3. utilize the voice spectrum of asking for, ask for phonetic speech power spectrum according to formula (2).

4. adopt Gamma-tone bank of filters to process power spectrum, passage number adopts 40, and the formula of employing is as shown in (3).

5. adopt the average mode of frame to ask for the power spectrum of medium duration, computing formula formula described above (4), wherein M=2, utilize present frame and before it two frames and below the average power of two frames replace the medium duration power of former single frame as the medium duration power of noisy speech, this duration is [(2M+1)-1] * 10ms+25ms=65ms.

6. pair power spectrum carries out asymmetric filtering processing, follows the tracks of the noise in voice, power spectrum is sheltered to processing simultaneously, obtains clean speech power spectrum.In this step, calculate according to the formula in implementation method comprising the asymmetric low-pass filtering anti-noise module of sheltering.Formula used has (5,6,7,8,9,10,11), and wherein the value of design parameter is described below:

λ _a＝0.999，λ _b＝0.5

c＝2

λ _t＝0.85，μ _t＝0.2

7. it is average that the ratio of pair clean speech and noisy speech power spectrum carries out passage, level and smooth to carry out, the formula adopting is (12), wherein N=4 (port number of looking around respectively forward and is backward 4), and the value that is about to 9 passages is carried out smoothly.The power spectrum of the clean speech after level and smooth and noisy speech is multiplied each other with the short-time rating spectrum of the noisy speech of bank of filters output, i.e. formula (13), the short-time rating that obtains clean speech is composed.

8. the short-time rating of pair clean speech spectrum is carried out energy normalized processing, to eliminate multiplicative noise, as shown in formula (15).Wherein average adopts dynamic estimation, and as shown in formula (14), initial value obtains from data centralization statistics.

9. power spectrum the loudness such as carries out and increases the weight of, and makes it meet human auditory system effect.

10. pair power spectrum carries out index operation, makes its physiological characteristic that meets people, and index nonlinear parameter θ chooses 1/15 here.

11. pairs of power spectrum carry out inverse fourier transform, can utilize the fundamental formular of inverse fourier transform to calculate, because itself count just seldom, calculated amount is little.

12. while asking for the cepstrum coefficient of signal, and the linear predictor coefficient of choosing is 12, and cepstrum coefficient is also 12, and formula is as shown in (17).

13. pairs of cepstrum coefficients carry out average normalized, finally obtain the phonetic feature of the inventive method.

The described feature extracting method that the present invention proposes and the Contrast on effect of common feature extracting method:

Utilize feature extracting method of the present invention on 863 desktop voice collection, to extract the feature of voice, use the phonetic feature of senior anti-noise leading portion (AFE) the feature extraction 863 desktop voice collection of PLP feature extracting method and ETSI (ETSI) simultaneously.Utilize this three feature sets, under identical condition, adopt respectively HTK instrument training acoustic model.Then, choose 1000 and purely read aloud recording, add simulating the white noise, then utilize three kinds of above-mentioned feature extracting methods to extract respectively feature.In addition, a random recording collection of talking is marked, obtain 7072 pure recording and 360 band Recording Noises, still use three kinds of above-mentioned feature extracting methods to extract respectively phonetic feature.

Utilize above-mentioned acoustic model and its characteristic of correspondence to carry out speech recognition, language model all adopts same 3 gram language model, and recognizer adopts the demoder in HTK instrument.Here adopt Word Error Rate (WER) to assess speech recognition performance, wherein, PNPLP is the title of feature extraction algorithm of the present invention.The computing formula of WER is as follows:

Under the test condition of simulating the white noise, the performance of various features is as shown in table 1.In table 1, can find out not have in noisy clean speech situation, PLP mark sheet reveals good performance, but along with the increasing of noise, PLP performance variation gradually.The anti-noise feature (AFE) of ETSI can show certain effect on noise, but feature extracting method noiseproof feature of the present invention will be far superior to the anti-noise algorithm of ETSI.

Table 2 is various feature extraction algorithms experimental results on real test set.From subordinate list, can find out, feature extracting method noiseproof feature of the present invention is outstanding, will get well much than the noiseproof feature of ETSI.In addition, anti-noise feature extraction algorithm of the present invention has slightly and declines with respect to classical PLP algorithm on clean speech collection, but contrasts with the anti-noise algorithm of ETSI, and anti-noise feature extraction algorithm of the present invention is still well a lot.

Table 1

Table 2

WER	PLP	AFE	PNPLP
				clear	11.64％	13.68％	12.07％
noise	35.89％	34.21％	33.36％

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. for a Speech Feature Extraction for robust speech identification, it is characterized in that, the method comprises:

Step 1, obtain the power spectrum of voice signal;

Step 9, the short-time rating spectrum of having carried out the clean speech after index nonlinear operation is carried out to inverse fourier transform, to ask for cepstrum coefficient, cepstrum coefficient is carried out to average normalized, finally obtain phonetic feature;

Wherein, the frequency spectrum that obtains voice signal described in step 1 further comprises following content:

Step 11, to voice signal adopt formula (1) carry out pre-emphasis:

y_{t} = \{\begin{matrix} x_{t} - α \cdot x_{t - 1} & ift > 0 \\ (1 - α) \cdot x_{t} & ift = 0 \end{matrix} - - - (1)

Wherein α is called as pre emphasis factor, and x is speech sample point, and y is the speech sample point value after pre-emphasis, the index that t is sampled point;

Step 12, each frame of the voice after pre-emphasis is carried out to windowing, adopt Hanning window or hamming window, voice after windowing are carried out to short time discrete Fourier transform, comprise and adopt original Fourier to change formula, or utilize fast fourier transform to solve to the integer power of the voice zero padding to 2 after windowing;

Step 13, the real part after Short Time Fourier Transform and imaginary part are asked for respectively square and summation, to obtain the power spectrum of voice signal, as shown in formula (2):

P(w)＝Re[S(w)] ²+Im[S(w)] ² (2)

The power spectrum that wherein P (w) is voice signal, S (w) represents fourier spectra in short-term, Re[S (w)] and Im[S (w)] real part and the imaginary part of fourier spectra in short-term represented respectively.

2. the Speech Feature Extraction for robust speech identification according to claim 1, it is characterized in that, the bank of filters adopting in described step 2 is Mel bank of filters Mel-filter bank or Gamma-tone bank of filters, port number is chosen different numbers according to different wave filters, wherein, the short-time rating that utilizes the summation of Gamma-tone bank of filters to obtain noisy speech is composed, as shown in formula (3):

P [m, l] = Σ_{k = 0}^{(K / 2) - 1} {| X [m, e^{{jw}_{k}}] H_{l} (e^{{jw}_{k}}) |}^{2} - - - (3)

Wherein p[m, l] be the short-time rating spectrum of noisy speech, m and l represent respectively the index of frame and bank of filters passage, what K was Fourier transform counts, w _k=2 π/F _s, F _srepresent the sample frequency of voice signal, represent voice signal m frame the the amplitude of frequency, represent l passage the Gamma-tone filter value of frequency.

3. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, adopts the average mode of frame to ask for the medium duration power spectrum of noisy speech in step 3, as shown in formula (4):

Q [m, l] = \frac{1}{2 M + 1} Σ_{m^{'} = m - M}^{m + M} P [m^{'}, l] - - - (4)

Wherein Q[m, l] be the medium duration power spectrum of noisy speech, m and l represent respectively the index of frame and bank of filters passage, when M represents to ask for medium duration, the frame number sum got forward and backward respectively, P[m ', l] be the noisy speech short-time rating spectrum of the m ' frame.

4. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, described in step 4, the medium duration power spectrum of obtained noisy speech is carried out asymmetric filtering and shelter anti-noise specifically comprising following step:

Step 41, medium obtained noisy speech duration power spectrum is carried out to filtering through first asymmetric low-pass filter, and the Output rusults that medium described noisy speech duration power spectrum is deducted to described first asymmetric low-pass filter is to integrate; Wherein said first asymmetric low-pass filter as formula (5) represent:

Q_{le} [m, l] = \{\begin{matrix} λ_{a} Q_{le} [m - 1, l] + (1 - λ_{a}) Q [m, l], ifQ [m, l] &GreaterEqual; Q_{le} [m - 1, l] \\ λ_{b} Q_{le} [m - 1, l] + (1 - λ_{b}) Q [m, l], ifQ [m, l] < Q_{le} [m - 1, l] \end{matrix} - - - (5)

Wherein m and l represent respectively the index of frame and bank of filters passage, Q _le[m, l] is the output of described first asymmetric low-pass filter; Q[m, l] be the medium duration power spectrum of described noisy speech, λ _aand λ _bfor adjustable parameter, span is (0～1);

Step 42, the result after described integration is obtained to Q through a half-wave rectification block _o[m, l], by Q _o[m, l] sends into respectively masking block and second asymmetric low-pass filter processed, and described second asymmetric low-pass filter is identical with first asymmetric low-pass filter, and the output of second dissymetrical filter is as spectrum end power; The Q that wherein half-wave rectification block obtains _othe formula of [m, l] is expressed as follows:

Q _o[m，l]＝max(Q[m，l]-Q _le[m，l]，0) (6)

Step 43, the described Q obtaining through half-wave rectification block _o[m, l] obtains result Q after masking block is processed _tm[m, l], and Q _o[m, l] obtains result Q after second asymmetric low-pass filter processed _f[m, l], afterwards, described Q _tm[m, l] and described Q _f[m, l] is input to maximal value module and obtains result R _sp[m, l]; Wherein maximal value module is as shown in formula (7):

R _sp[m，l]＝max(Q _tm[m，l]，Q _f[m，l]) (7)

Step 44, determined the medium duration power spectrum R[m of clean speech by the first selective switch, l], described the first selective switch is as shown in formula (8):

R [m, l] = \{\begin{matrix} R_{sp} [m, l], ifQ [m, l] &GreaterEqual; {cQ}_{le} [m, l] \\ Q_{f} [m, l], ifQ [m, l] < {cQ}_{le} [m, l] \end{matrix} - - - (8)

Wherein c is adjustable parameter.

5. the Speech Feature Extraction for robust speech identification according to claim 4, is characterized in that, the operating process of described masking block comprises following content:

The described Q obtaining through half-wave rectification block _o[m, l] obtains Q through the MAX module of masking block _p[m, l], as shown in formula (9):

Q _p[m，l]＝max(λ _tQ _p[m-1，l]，Q _o[m，l]) (9)

Wherein λ _tbe Forgetting coefficient, span is (0～1), the output valve Q that masking block is last _tm[m, l] determined by the second selective switch, and described the second selective switch is as shown in formula (10):

Q_{tm} [m, l] = \{\begin{matrix} Q_{o} [m, l] & Q_{o} [m, l] &GreaterEqual; λ_{t} Q_{p} [m - 1, l] \\ μ_{t} Q_{p} [m - 1, l] & Q_{o} [m, l] < λ_{t} Q_{p} [m - 1, l] \end{matrix} - - - (10)

Wherein μ _tfor corresponding parameter, span is (0～1), the output valve Q of masking block _tmafter the selective switch that [m, l] describes through formula (8), final, the medium duration power spectrum of noisy speech is carried out to asymmetric filtering and the result of sheltering after anti-noise is R[m, l].

6. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, the short-time rating spectrum of obtaining clean speech in described step 5 comprises following content:

Step 51, calculates the ratio H[m of the medium duration power spectrum of obtained clean speech and the medium duration power spectrum of noisy speech, l], as shown in formula (11):

H [m, l] = \frac{R [m, l]}{Q [m, l]} - - - (11);

Wherein said R[m, l] be the medium duration power spectrum of clean speech, Q[m, l] be the medium duration power spectrum of noisy speech;

Step 52, carries out passage average, to obtain passage average weight H _s[m, l], as shown in formula (12):

H_{s} [m, l] = (\frac{1}{l_{2} - l_{1} + 1} Σ_{i^{'} = l_{1}}^{l_{2}} H [m, l]) - - - (12)

Wherein l ₂=min (l+N, L), l ₁=max (l-N, 1), L represents the number of filter channel, N represents to ask for passage average time, the total number of channels of looking around forward and backward;

Step 53, utilizes described passage average weight H _sthe short-time rating spectrum P[m of [m, l] modulation noisy speech, l], and obtain the short-time rating spectrum T[m of clean speech, l], as shown in formula (13):

T[m，l]＝P[m，l]H _s[m，l] (13)。

7. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, the short-time rating spectrum to clean speech in described step 6 is carried out energy normalized processing, as shown in formula (15):

U [m, l] = k \frac{T [m, l]}{μ [m]} - - - (15)

Wherein k is arbitrary constant, T[m, l] be the short-time rating spectrum of clean speech, μ [m] is as shown in formula (14):

μ [m] = λ_{μ} μ [m - 1] + \frac{1 - λ_{μ}}{L} Σ_{l = 0}^{L - 1} T [m, l] - - - (14)

Wherein L represents the number of filter channel, λ _μrepresent Forgetting coefficient, span is (0～1).

8. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, in described step 7, the described spectrum of the short-time rating to clean speech the loudness such as is carried out and increased the weight of as shown in formula (17):

O[m，l]＝U[m，l]·E(w _l) (17)

Wherein m, l is respectively the index of frame and passage, U[m, l] for voice are through anti-noise short-time rating spectrum after treatment, E (w _l) as shown in formula (16):

E (w_{l}) = \frac{(w_{l}^{2} + 1.44 \times 10^{6}) w_{l}^{4}}{{(w_{l}^{2} + 1.6 \times 10^{5})}^{2} \times (w_{l}^{2} + 9.61 \times 10^{6})} - - - (16)

Wherein w represents frequency, w _lit is the frequency of l passage.

9. the Speech Feature Extraction for robust speech identification according to claim 1, it is characterized in that, in described step 8, the short-time rating spectrum of the clean speech after described reciprocity loudness increases the weight of is carried out index nonlinear operation as shown in formula (18):

L[m，l]＝O[m，l] ^θ (18)

Wherein θ is the nonlinear parameter of index, O[m, l] be the short-time rating spectrum of the clean speech of the loudness such as described after increasing the weight of.