CN108172214A

CN108172214A - A kind of small echo speech recognition features parameter extracting method based on Mel domains

Info

Publication number: CN108172214A
Application number: CN201711439300.XA
Authority: CN
Inventors: 胡宁; 胡晓宁; 程海峰; 宁璐; 朱方敢; 洪英举; 王龙峰; 王智超; 王晏平
Original assignee: Anhui University of Architecture
Current assignee: Anhui Jianzhu University; Anhui University of Architecture
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2018-06-15

Abstract

The invention discloses a kind of small echo speech recognition features parameter extracting methods based on Mel domains, the voice signal of input is pre-processed first, then the character vector of extraction reflection signal characteristic, then set up the reference model library of trained voice, then identification candidate result output is obtained by comparing, and finally identification candidate result is handled to obtain final recognition result by phonic knowledge.The present invention proposes parameter WPCC, and wavelet filter replaces Mel wave filters, and wavelet transform substitution discrete cosine transform by the parameter for consonant and recognition of vowels, has preferable effect.

Description

A kind of small echo speech recognition features parameter extracting method based on Mel domains

Technical field

The present invention relates to speech parameter generation method field, specifically a kind of small echo speech recognition features based on Mel domains Parameter extracting method.

Background technology

Signal processing is generally all using Fourier transformation in speech recognition.Fourier transformation physical significance is intuitive, meter It is simple and direct, it is widely used in the spectrum analysis of signal.But also there is serious deficiency.Fourier transformation illustrates signal spectrum Statistical property, it is integration of the signal in entire time domain, the spectrum characterization of the signal overall strength of signal intermediate frequency rate component, but But it not can be shown that when these frequency components generate, without the function of partial analysis signal, do not have transient information.And To that in the analysis of time-varying or non stationary speech signal (especially consonant), should know that signal is neighbouring at any time as far as possible Frequency domain character, therefore one-dimensional time-domain signal is mapped to the time-frequency characteristic that a two-dimentional time-frequency plane carrys out observation signal, i.e., The phase space of signal is built, then forms the time frequency analysis of signal.Wavelet transformation taking on time-frequency domain to different frequency contents Sample step-length is modulability, its sampling step length in high frequency is small, and sampling step length is big in low frequency.Wavelet transformation in time-frequency domain all There is partial analysis ability, exactly these characteristics so that wavelet transformation has the advantage of bigger in speech signal processing.

Fourier transform processing stationary signal is preferable, and poor to nonstationary random response effect.Consonant is changed in time-frequency domain Fast signal, wavelet transformation are preferably to select.Farooq et al. [1] propositions obtain local frequencies section feature with wavelet packet, small Frequency partition is multiple subbands by wave packet, and sub-belt energy value is as characteristic parameter, and in plosive identification, discrimination is than parameter MFCC Improve 10 percentage points.Voice make an uproar relative to being superimposed interference value on time-frequency domain in clean speech, in characteristic parameter Subtract a definite value in extraction, this value be equivalent to white noise spectrum value and clean speech characteristic close to [2]；Farooq[3] Local frequencies section is divided with wavelet transform again, low frequency part obtains thinner division, in phoneme recognition medial vowel discrimination It is best.Physiologic Studies prove that basilar memebrane in the cochlea to play a crucial role to the sense of hearing functions as an establishment and stands in film The band logical frequency analyzer of permanent Q on the basis of vibration.And length shows the high fdrequency component duration after physiological signal is decomposed It is shorter, the long-term feature of low frequency component.This also just with the property matches mutually of wavelet analysis.For this purpose, Zhang Xueying etc. People [4] proposes, based on Bark domains WAVELET PACKET DECOMPOSITION, to apply in speech recognition, discrimination is 10 higher than parameter MFCC in noise Percentage point.WAVELET PACKET DECOMPOSITION is decomposed in wavelet space and scale space, obtains numerous frequency ranges, from the viewpoint of signal processing It sees, use coefficient few as possible reflects information as much as possible, this needs Optimization of Wavelet packet to decompose.Jorge Silva [5] are proposed Lowest costs tree trimming algorithm carries out WAVELET PACKET DECOMPOSITION, and preferable effect, P.K.Sahu et al. proposition are obtained in phoneme recognition Cochlea bandpass filter group, then extracting parameter are replaced based on Bark domains WAVELET PACKET DECOMPOSITION [6] [7], identified in isolated word recognition Effect is preferable, especially in noisy environment.

Parameter MFCC final steps are cepstrum operations, and cepstrum operation includes discrete cosine transform, and discrete cosine transform is Fu Family name transformation real part, Fourier transform is the statistical property of signal, it is the integration in the entire time domain of signal, when a frequency range by Influence of noise, entire frequency range, which is subjected to, to be involved.And Fourier transform has serious spectrum leakage in high frequency.Wavelet transform Strong to signal partial analysis ability, it can characterize the local feature of signal.Replaced in cepstrum operation using wavelet transform Discrete cosine transform, noise generally in high frequency coefficient, extract low frequency coefficient [8], achieve the effect that denoising, used in speaker In the feature extraction [9] and the feature extraction of speech recognition of identification [10], discrimination is preferable in voice of making an uproar.

One frame voice signal may include two phonemes, if previous phoneme is consonant, the latter phoneme is vowel, then The low frequency and high frequency of previous phoneme frequency are influenced by the latter phoneme low frequency and high frequency, and MFCC parameter extractions are to whole A frequency range processing, can not overcome the influence for closing on phoneme.And wavelet transform captures the information of phoneme transition, and this mistake Some local frequencies sections may be only present in by crossing information, and Nehe N.S. [11] divide signal frequency range with wavelet transform, LPCC (Linear Predictive Cepstral Coefficient) is in subband, preferable knot is achieved to speech recognition Fruit.Similary Weaam Alkhaldi [12] are applied in Arabic identification and call voice identification [13] system.Malik [14] with same approach application in Speaker Identification.Mangesh S.Deshpande [15] divide frequency with WAVELET PACKET DECOMPOSITION Section, Jian-Da Wu [16] are decomposed with irregular wavelet packet and are divided frequency range, and preferable effect is all achieved in Speaker Identification.

Attached bibliography

【1】.Farooq O.and Datta S.,Robust features for speech recognition based on admissible wavelet packets,[J].Electronics letters 6th December 2001Vol. 37,No.25,pp.1554-1556

【2】.Farooq O.and Datta S.,Wavelet based robust sub-band features for phoneme recognition,[J].IEE Proc.-Vis.Image Signal Process,Vol.151,No.3,June 2004,pp.187-193

【3】.Farooq O.,Datta S.,Phoneme recognition using wavelet based features,[J]. Information Sciences 150(2003),pp.5-15

【4】.Xue-ying Zhang,The Speech Recognition System Based On Bark Wavelet MFCC,[C].8th International Conference on Signal Processing,2007

【5】.P.K.Sahu and Astik Biswas,Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature, Computers and Electrical Engineering 42,2015,pp.12-22

【6】.P.K.Sahu and Astik Biswas,Admissible wavelet packet features based on human inner ear frequency response for Hindi consonant recognition, Computers and Electrical Engineering 40,2014,pp.1111-1122

【7】.Jorge Silva,Shrikanth S.Narayanan,Discriminative Wavelet Packet Filter Bank Selection for Pattern Recognition,[J].IEEE Transactions on signal processing, VOL.57,NO.5,May 2009,pp.1796-1810

【8】.Tufekci Z,Gowdy J.N.,Feature extraction using discrete wavelet transform for speech recognition,[C].Conference Proceedings-IEEE Southeastcon,2000,pp.116-123

【9】.Tufekci Z,Noise Robust Speaker Verification Using Mel-Frequency Discrete Wavelet Coefficients and Parallel Model Compensation,[C].IEEE International Conference on Acoustics,Speech and Signal Processing- Proceedings,v I,2005,pp.1657-1660

【10】.Tufekci Z,Gowdy John N.Applied mel-frequency discrete wavelet coefficients and parallel model compensation for noise-robust speech recognition,[J].Speech Communication 48,2006,pp.1295-1307

【11】.Nehe N.S.,New Robust Subband Cepstral Feature for Isolated Word Recognition,[C].International Conference on Advances in Computing, Communication and Control,Mumbai,Maharashtra,India.January 23–24,2009, pp.326-330

【12】.Weaam Alkhaldi,Waleed Falbr and Nadder Hamdy,Multi-band based recognition of spoken arabic numerals using wavelet transform, [C] .Proceedings of The 19th National Radio Science Conference,Alexandria,Egypt, March 2002,pp.224-229

【13】.Alkhaldi W.,Automatic Speech/Speaker Recognition In Noisy Enviroments Using Wavelet Transform,[C].The 45th Midwest Symposium on Circuits and Systems,2002,pp.463-466

【14】.Malik S.,Wavelet Transform Based Automatic Speaker Recognition, [C].IEEE 13th International Multitopic Conference,2009

【15】.Mangesh S.Deshpande,Speaker Identification Using Admissible Wavelet Packet Based Decomposition,[J].International Journal of Signal Processing 6:1,2010,pp.20-23

【16】.Jian-Da Wu,Speaker identification using discrete wavelet packet transform technique with irregular decomposition,[J].Expert Systems with Applications 36,2009,pp.3136-3143

Invention content

The object of the present invention is to provide a kind of small echo speech recognition features parameter extracting method based on Mel domains, to solve Prior art Fourier transformation handle voice signal there are the problem of.

In order to achieve the above object, the technical solution adopted in the present invention is：

A kind of small echo speech recognition features parameter extracting method based on Mel domains, it is characterised in that：Include the following steps：

(1), input speech signal；

(2), the voice signal of input is pre-processed；

(3), after pre-processing, the character vector of reflection signal characteristic is extracted from voice signal based on wavelet transformation；

(4), according to the character vector of extraction, the reference model library of training voice is established；

(5), the model of the character vector of the voice signal of input and reference model library is compared, selected similar Highest model is spent as identification candidate result output；

(6), the identification candidate result in step 5 is handled to obtain final recognition result by phonic knowledge.

A kind of small echo speech recognition features parameter extracting method based on Mel domains, it is characterised in that：Step (3) Middle process is as follows：

(1), the voice signal of input is pre-processed, framing, windowed function；

(2), the voice signal of every frame adding window is subjected to wavelet package transforms, obtains sub-band；

(3), the energy spectrum of each sub-band is extracted；

(4), wavelet transform is taken to energy spectrum, obtains 13 and maintain number.

Input Chinese vowel, consonant x (t), t are time variable,

Voice signal is sampled：Sample frequency f is carried out to input speech signal_sFor the sampling of 8kHz, the letter after sampling Number for x (t) ',, then carry out preemphasis 1-0.98Z^-1Processing, 1-0.98Z^-1Forms of time and space beVoice signal after preemphasis isWherein,For impulse function；

With the long 32ms of window, the Hamming window that window moves 16ms carries out voice signal windowing process, and framing is using overlapping segmentation The overlapping part of method, former frame and a later frame is moved for frame, is realized with the method that moveable finite length window is weighted, I.e. with window function w'(t) multiply the voice signal a (t) after preemphasis, so as to forming adding window voice signal b (t), b (t)=a (t) ×w'(t)

Its window function is：

N is long for window, and window length is frame length, and the i-th frame signal obtained after adding window sub-frame processing is

x_i(t)=w'(t) b (t), 0≤t≤N-1

Characteristic parameter extraction stage：

24 frequency range WAVELET PACKET DECOMPOSITIONs of Fig. 3 are carried out to pretreated each frame voice signal, energy is taken to each frequency band Amount spectrum, then 3 layer scattering wavelet transformations is taken to obtain parameter WPCC to parameter；

Compared with the prior art, beneficial effects of the present invention are embodied in：

Wavelet transformation all has the ability of extraction signal local feature in two domain of time-frequency.The present invention proposes parameter WPCC, Wavelet filter replaces Mel wave filters, and wavelet transform substitution discrete cosine transform knows the parameter for consonant and vowel Not, it is higher in fast consonant (plosive, the breach sound, affricate) discrimination of variation, small echo high-pass and low-pass filter and preferable high low pass Wave filter is close, is associated with small between frequency range, and discrete wavelet spectral sidelobes component is small, and discrete cosine spectral sidelobes value is big, discrete wavelet It is influenced compared with discrete cosine more noise resistance.Meanwhile above-mentioned parameter is used in isolated word recognition, also obtain preferable effect.

Description of the drawings

Fig. 1 is the flow chart of the present invention.

The hardware architecture diagram of Fig. 2 positions present invention.

Fig. 3 is the WAVELET PACKET DECOMPOSITION figure of the present invention.

Table one is WAVELET PACKET DECOMPOSITION centre frequency of the present invention and bandwidth and Mel domains centre frequency and bandwidth.

Specific embodiment

As shown in Figure 1-Figure 3, a kind of small echo speech recognition features parameter extracting method based on Mel domains, including following step Suddenly：

(1), input speech signal；

(2), the voice signal of input is pre-processed；

2nd, a kind of small echo speech recognition features parameter extracting method based on Mel domains according to claim 1, it is special Sign is：Process is as follows in step (3)：

(1), the voice signal of input is pre-processed, framing, windowed function；

(3), the energy spectrum of each sub-band is extracted；

3rd, a kind of small echo speech recognition features parameter extracting method based on Mel domains according to claim 2, specifically Step is as follows：

Input Chinese vowel, consonant x (t), t are time variable,

Voice signal is sampled：Sample frequency f is carried out to input speech signal_sFor the sampling of 8kHz, the letter after sampling Number for x (t) ',, then carry out preemphasis 1-0.98Z^-1Processing, 1-0.98Z^-1Forms of time and space beVoice signal after preemphasis isWhereinFor impulse function；

Its window function is：

x_i(t)=w'(t) b (t), 0≤t≤N-1

Characteristic parameter extraction stage：

24 frequency range WAVELET PACKET DECOMPOSITIONs of Fig. 3 are carried out to pretreated each frame voice signal, energy is taken to each frequency band Amount spectrum, then 3 layer scattering wavelet transformations is taken to obtain parameter WPCC to parameter, as shown in table 1：

1 24 frequency range centre frequencies of table and bandwidth

Claims

1. a kind of small echo speech recognition features parameter extracting method based on Mel domains, it is characterised in that：Include the following steps：

(1), input speech signal；

(2), the voice signal of input is pre-processed；

(5), the model of the character vector of the voice signal of input and reference model library is compared, selects similarity most High model is as identification candidate result output；

2. a kind of small echo speech recognition features parameter extracting method based on Mel domains according to claim 1, feature exist In：Process is as follows in step (3)：

(1), the voice signal of input is pre-processed, framing, windowed function；

(3), the energy spectrum of each sub-band is extracted；

3. a kind of small echo speech recognition features parameter extracting method based on Mel domains according to claim 2, specific steps It is as follows：

Input Chinese vowel, consonant x (t), t are time variable,

Voice signal is sampled：Sample frequency f is carried out to input speech signal_sFor the sampling of 8kHz, the signal after sampling is x (t)',, then carry out preemphasis 1-0.98Z^-1Processing, 1-0.98Z^-1Forms of time and space beVoice signal after preemphasis isWherein,For impulse function；

With the long 32ms of window, the Hamming window that window moves 16ms carries out voice signal windowing process, and framing uses the overlapping method being segmented, The overlapping part of former frame and a later frame is moved for frame, is realized, that is, used with the method that moveable finite length window is weighted Window function w'(t) multiply the voice signal a (t) after preemphasis, so as to forming adding window voice signal b (t), b (t)=a (t) × w' (t)

Its window function is：

x_i(t)=w'(t) b (t), 0≤t≤N-1

Characteristic parameter extraction stage：

Carry out 24 frequency range WAVELET PACKET DECOMPOSITIONs of Fig. 3 to pretreated each frame voice signal, each mid-band frequency and Bandwidth is shown in Table 1, and energy spectrum is taken to each frequency band, then 3 layer scattering wavelet transformations is taken to obtain parameter WPCC to parameter；

Using word as a recognition unit, it is identified using template matching method, it, will be each in training data in the training stage The characteristic vector time series of word extraction is as template deposit template library, in cognitive phase, by the characteristic vector of voice to be identified Time series carries out similarity-rough set with each template in template library successively, is exported similarity soprano as recognition result.