CN103778914A

CN103778914A - Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching

Info

Publication number: CN103778914A
Application number: CN201410040474.9A
Authority: CN
Inventors: 宁更新; 吴丽菲; 宁小娟
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2014-01-27
Filing date: 2014-01-27
Publication date: 2014-05-07
Anticipated expiration: 2034-01-27
Also published as: CN103778914B

Abstract

The invention discloses an anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching. The anti-noise voice identification method based on signal-to-noise ratio weighing template characteristic matching comprises the following steps that (1) input voice signals are preprocessed, and a phase position coefficient is obtained; (2) the characteristics of input voice, namely a phase position MFCC, are calculated; (3) characteristic matching is carried out on a template based on SNR. The invention further discloses a device of the anti-noise voice identification method based on signal-to-noise ratio weighing template characteristic matching. The device comprises a power source module, a display module, a storage module, a DSP/ARM digital processing module, a microphone, an A/D converter and a USB interface. The anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching have the advantages of being wide in application range, high in accuracy, low in cost, convenient and fast to use, high in adaptability and the like.

Description

Based on anti-noise audio recognition method and the device of noise Ratio Weighted template characteristic coupling

Technical field

The present invention relates to a kind of sound signal processing technology, particularly a kind of anti-noise audio recognition method and device based on noise Ratio Weighted template characteristic coupling.

Background technology

The application of speech recognition is very extensive, almost relates to the every aspect of daily life.As voice dialing system, seat reservation system, medical services, bank service, dictation machine, computer control, Industry Control, voice communication system etc.Speech recognition technology changes the existing daily life style of the mankind deeply in every field such as industry, household electrical appliances, communication, medical treatment, home services.Nowadays, actual environment requires more and more higher to the acoustic noise robustness of speech recognition, and therefore, the proper vector that extraction has robustness and stronger separating capacity has great importance to speech recognition system.

For all power spectrum based on voice signal of feature of speech recognition, power spectrum has represented the energy distribution of signal in frequency domain scope now.In the time there is external noise, this energy distribution has also comprised the energy of noise.This just makes corresponding proper vector very responsive to external noise, causes the speech recognition system performance under noise circumstance not good.

The method of separating the external portion of block eigenvector noise-sensitive mainly contains two aspects, and one is based on feature, and one is based on model.Method based on feature is at the front end of speech recognition system, the proper vector generating to be had nothing to do as far as possible with noise.And method based on model is the rear end in speech recognition system, by a small amount of self-adapting data under test environment, model parameter is adjusted, gradually model parameter is transformed to actual environment, thereby reach the object that improves system recognition rate.Solution based on feature has spectrum-subtraction, RASTA facture etc.Method based on model has parallel model mixing method (PMC), the adaptive method (VTS) based on vector Taylor series, signal decomposition method etc.

At present, mainly contain two kinds for the phonic signal character parameter of the extraction of speech recognition: linear prediction cepstrum coefficient coefficient (LPCC) and Mel frequency cepstral coefficient (MFCC).LPCC characteristic parameter can effectively represent speech parameter and have higher computing velocity, but not consider that the mankind's auditory system is to the feature of speech processes.Mel frequency band division is a kind of through engineering approaches simulation to human hearing characteristic, and MFCC has simulated the feature of people's ear to speech processes to a certain extent.

But no matter be MFCC or LPCC, existing speech recognition features, recognition performance under low signal-to-noise ratio environment is not fine, in order to overcome this weakness, first the present invention proposes a kind of new feature in low signal-to-noise ratio situation by changing relativity measurement with better robustness, adopt two angles between time-delay signal vector as relativity measurement, because angle is the nonlinear transformation of traditional coefficient of autocorrelation scalar product, can on frequency domain, strengthen the effect of crest by phase place, and crest relative noise robustness is higher.Then, be suitable for high s/n ratio according to traditional characteristic, new feature is suitable for low signal-to-noise ratio, proposes a kind ofly according to the template matches computing method of noise Ratio Weighted, finally proposes related device.

Summary of the invention

Primary and foremost purpose of the present invention is that the shortcoming that overcomes prior art, with not enough, provides a kind of anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling, the method wide accommodation, and accuracy is high.

The shortcoming that another object of the present invention is to overcome prior art is with not enough, a kind of device of realizing the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling is provided, at DSP/ARM7 chip operation, can use the TMS320C6711 of TI or the ARM7S3C44B0 of Samsung to realize.

Primary and foremost purpose of the present invention is achieved through the following technical solutions: a kind of anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling, comprises the following steps:

Step 1: input speech signal is carried out to pre-service, try to achieve phase coefficient;

By the voice signal s[n after digitizing] divide frame processing, adopt Hamming window to carry out windowing to it simultaneously.Be divided into T frame,

{s ₀[n],s ₁[n],...,s _t[n],...,s _T-1[n]}

Wherein

s _t[n]={s[Kt],s[Kt+1],...,s[Kt+N-1]}

K is that frame moves, and N is frame length, s _t[n] is the frame signal sequence at moment t.

Voice signal has stationarity in short-term, and therefore every frame signal is all stably.Gained frame signal is carried out to periodic extension, obtain thus autocorrelation function and be

R [k] = Σ_{n = 0}^{N - 1} \tilde{s_{t}} [n] \tilde{s_{t}} [n + k],

k=0,1,...,N-1。

Can be found out R[k by above formula] be the dot product of two N dimensional vectors,

x_{0} = {\tilde{s_{t}} [0], \tilde{s_{t}} [1], . . ., \tilde{s_{t}} [N - 1]},

x_{k} = {\tilde{s_{t}} [k], . . ., \tilde{s_{t}} [N - 1], \tilde{s_{t}} [0], . . ., \tilde{s_{t}} [k - 1]},

R [k] = x_{0}^{T} x_{k} = {| | x | |}^{2} \cos (θ_{k}),

(formula 1)

Wherein, || x|| ²=|| x ₀|| ²=|| x _k|| ², expression be frame energy.θ _kit is vector x ₀and vector x _kat the angle of N dimension space.

The nonlinearities change of normalized coefficient of autocorrelation being carried out to arc cosine, obtains phase coefficient.

P [k] = θ_{k} = \cos^{- 1} (\frac{R [k]}{{| | x | |}^{2}}),

(formula 2)

P[k] span be between 0 to π, normalized to 0 to 1, obtain normalized Phase autocorrelation function

P_{n} [k] = \frac{P [k]}{π} = \frac{\cos^{- 1} (R_{n} [k])}{π},

(formula 3)

P _n[k] can improve the robustness in low signal-to-noise ratio situation, but in high s/n ratio situation, especially, in the situation of clean speech, performance is not as R _n[k].

Step 2: calculate the feature of input voice, i.e. phase place MFCC;

Respectively to P _n[k] carries out DFT conversion, obtains phase power spectrum S _p[l].

S_{p} [l] = Σ_{k = 0}^{N - 1} P_{n} [k] \exp (- j \frac{2 π}{N} kl),

(formula 4)

Here S _p[l] is called phase power spectrum, and the MEL frequency cepstral coefficient therefrom obtaining is called phase place MFCC,, by the filtering of Mel dimensions in frequency bank of filters, then carries out logarithm operation.When the information separated at each frequency band out after, frequency domain character is changed in time domain with discrete cosine transform (DCT), obtain phase place MFCC parameter.

Phase place MFCC parameter is chosen the static cepstrum coefficient in L rank and single order and second derivative, altogether 3L dimension.

Step 3: the template characteristic coupling based on SNR;

In reference database, there is j reference voice data template, wherein comprise the MFCC feature of 3M dimension and the phase place MFCC feature of 3L dimension.The test template of proper vector 3M dimension MFCC and the wherein Euclidean distance between i reference template are D _mi, the Euclidean distance between test template and the i reference template of proper vector 3L dimension phase place MFCC is P _li, i=0,1 ..., j-1.

The known robustness of proper vector N dimension phase place MFCC that adopts in low signal-to-noise ratio situation is higher, and in high s/n ratio situation, especially, in clean speech situation, adopts the robustness of proper vector M dimension MFCC higher.

According to this point, the present invention adopts a kind of method based on noise Ratio Weighted, under different signal to noise ratio (S/N ratio) conditions, adopts different weight values, obtains the weight distance value C of two feature vectors in template spacing _i.

C _i=(1-w) D _mi+ wP _li, i=0,1 ..., j-1, (formula 5)

Template matches process is searched for exactly in j reference template, finds and makes min{C _i, i=0,1 ..., that template that j-1 sets up.

W is the weight of phase place MFCC parameterized template spacing, and its value determines by signal to noise ratio snr, and signal to noise ratio (S/N ratio) can obtain thus:

SNR = \log_{10} (\frac{{| | Y | |}^{2}}{{| | N | |}^{2}}) &cong; \log_{10} (\frac{{| | Y | |}^{2}}{{| | N | |}^{2}}),

(formula 6)

{| | Y | |}^{2} = {| | X | |}^{2} + {| | N | |}^{2} &cong; {| | X | |}^{2} + \overset{&OverBar;}{{| | N | |}^{2}},

(formula 7)

|| Y|| ²what represent is the frame energy of voice in actual environment, || N|| ²what represent is the energy of the noise signal of sampling in actual environment,

represent the estimated value to this energy.

The value of w determines by signal to noise ratio snr,

W=f (SNR), (formula 8)

F (SNR) represents the relation between weight coefficient w and signal to noise ratio snr.F (SNR) span is (0,1), and with x negative correlation each other, this relation can be linear, can be also nonlinear.Can adopt following two kinds of modes to represent this relation:

Mode one:

w = f (SNR) = \exp (- \frac{SNR - α}{γ}) \cdot u (SNR - α) + u (SNR - α),

(formula 9)

Mode two:

w = f (SNR) = 1 - \frac{1}{1 + \exp [- \frac{(SNR - β)}{θ}]},

(formula 10)

U () is step function, and α span is (1,5), is the threshold value of SNR, in the time that SNR is less than α, weight coefficient w is 1, in the time that SNR is greater than α, and weight coefficient w and SNR negative correlation, and along index decreased, along with the growth of SNR, final w converges on 0 gradually.The span of β is (1,10), be equivalent to traditional MFCC and phase place MFCC weight equate time SNR critical value.The span of γ and θ is (0.1,1), is all used for regulating the speed changing, and its value is larger, variation just slower.

Another object of the present invention is achieved through the following technical solutions: a kind of device of realizing the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling, comprising: power module, display module, memory module, DSP/ARM digital signal processing module, microphone, A/D converter and USB interface; One end of described memory module, USB interface, display module, power module and A/D converter is all electrically connected with DSP/ARM digital signal processing module, the other end electrical connection of described microphone and A/D converter; Described microphone is for input test voice, described A/D converter is used for tested speech digitizing, described DSP/ARM chip is used for extracting feature and carries out template matches, described memory module is for reference data stored storehouse, described display module is used for showing result, and described USB interface is connected with computer.

Described A/D converter adopts ADC0832 chip; Described DSP/ARM digital signal processing module adopts DSP/ARM7 chip.

Described DSP/ARM7 chip adopts the TMS320C6711 of TI or the ARM7S3C44B0 of Samsung.

The present invention calculates on the basis of MFCC parameter at traditional coefficient of autocorrelation, has increased by phase coefficient and has replaced coefficient of autocorrelation to obtain phase place MFCC parameter, obtains individual features vector, and the template matches computing method of the noise Ratio Weighted of giving chapter and verse.

The present invention has following advantage and effect with respect to prior art:

One, wide accommodation.Application of the present invention is very extensive, almost relates to the every aspect of daily life.

Two, accuracy is high.The robustness that the present invention has applied phase place MFCC in low signal-to-noise ratio situation is higher, and in high s/n ratio situation, especially the higher characteristic of robustness of traditional MFCC in clean speech situation, improve feature extraction distance measure mode, accuracy, the especially accuracy in low signal-to-noise ratio situation of identification are improved.

Three, cost is low.Use a common DSP or ARM chip can complete all computings.

Four, easy to use.This device can be inserted on any equipment that has USB interface, and plug and play is very convenient.

Five, strong adaptability.Environment for use is not had to specific (special) requirements, can in most of environment, normally work.

Accompanying drawing explanation

Fig. 1 is the module frame chart of contrive equipment.

Fig. 2 is pre-service and the feature extraction process flow diagram of contrive equipment.

Fig. 3 is the template matches module process flow diagram of contrive equipment.

Fig. 4 is the hardware structure diagram of contrive equipment.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited to this.

Embodiment 1

As shown in Figure 1, first tested speech enters pretreatment module, then enter characteristic extracting module, obtain tested speech individual features vector MFCC and PAC-MFCC are input to template matches module, mate (concrete template matches module flow process as shown in Figure 3) with the template in reference database by calculating weight distance value, the matching template that obtains weight distance value minimum, result outputs to display module the most at last.

Wherein pretreatment process and feature extraction flow process as shown in Figure 2, are carried out pre-emphasis, digitizing in pretreatment process, divide frame, windowing, extracts tested speech frame feature in feature extraction flow process afterwards, by calculating coefficient of autocorrelation and phase coefficient, carry out FFT conversion, by MEL bank of filters, then by log-transformation and discrete Fourier transformation DCT, try to achieve traditional MFCC and phase place MFCC, and actual environment without tested speech situation under, estimated noise energy, tries to achieve respective environment SNR.

The specific implementation step of speech recognition equipment is as follows:

Step 1: tested speech is carried out to digitized processing, and sample frequency is 8kHz, then carries out pre-emphasis, take 20ms as a frame, frame moves as 10ms, and window is Hamming window.

Step 2: every frame voice are analyzed, first carried out periodic extension, then try to achieve normalized coefficient of autocorrelation and phase coefficient according to (formula 1-3).

Step 3: the coefficient of trying to achieve is carried out to FFT conversion, obtain corresponding power spectrum, then by two kinds of spectrums that obtain, by the MEL scaling filter filtering on 13 rank, by log-transformation and dct transform, try to achieve the static cepstrum coefficient of the static cepstrum coefficient of 13 rank MFCC and the phase place MFCC on 13 rank again, and get both single order and second derivative, obtain the MFCC parameter of 39 dimensions and the phase place MFCC parameter of 39 dimensions, as proper vector.

Step 4: in without tested speech situation, gather the noise signal in actual environment, obtain noise energy.By (formula 6) and (formula 7), estimate the signal to noise ratio (S/N ratio) under the actual environment that has tested speech again.

Step 5: the Euclidean distance D between test template and the reference template of calculated characteristics vector 39 dimension MFCC _m, proper vector 39 is tieed up the Euclidean distance P between test template and the reference template of phase place MFCC _n.

Step 6: calculate the weighted value of two proper vector template spacings according to (formula 8), last according to (formula 5), obtain weight distance value C.

The calculating formula of calculating respective weights is as follows:

w = f (SNR) = \exp (- \frac{SNR - α}{γ}) \cdot u (SNR - α) + u (SNR - α),

Get correlation parameter: α=3, γ=0.5.

As shown in Figure 4, a kind of device of realizing the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling, comprising: power module, display module, memory module, DSP/ARM digital signal processing module, microphone, A/D converter and USB interface; One end of described memory module, USB interface, display module, power module and A/D converter is all electrically connected with DSP/ARM digital signal processing module, the other end electrical connection of described microphone and A/D converter; Described microphone is for input test voice, described A/D converter is used for tested speech digitizing, described DSP/ARM chip is used for extracting feature and carries out template matches, described memory module is for reference data stored storehouse, described display module is used for showing result, and described USB interface is connected with computer.Described A/D converter adopts ADC0832 chip; Described DSP/ARM digital signal processing module adopts DSP/ARM7 chip.Described DSP/ARM7 chip adopts the TMS320C6711 of TI or the ARM7S3C44B0 of Samsung.

Embodiment 2

The present embodiment is except following content, with embodiment 1:

The calculating formula of calculating respective weights is as follows:

w = f (SNR) = 1 - \frac{1}{1 + \exp [- \frac{(SNR - β)}{θ}]},

Get correlation parameter: β=3, θ=0.5.

Above-described embodiment is preferably embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and principle, substitutes, combination, simplify; all should be equivalent substitute mode, within being included in protection scope of the present invention.

Claims

1. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling, is characterized in that, comprises the following steps:

Step 2: calculate the feature of input voice, i.e. phase place MFCC;

Step 3: the template based on SNR is carried out to characteristic matching.

2. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 1, is characterized in that, described step 1 comprises the following steps:

Steps A, by the voice signal s[n after digitizing] divide frame processing, adopt Hamming window to carry out windowing simultaneously, and be divided into T frame:

{s ₀[n],s ₁[n],...,s _t[n],...,s _T-1[n]}，

Wherein:

S _t[n]=and s[Kt], s[Kt+1] ..., s[Kt+N-1] }, K is that frame moves, N is frame length, s _t[n] is the frame signal sequence at moment t;

Step B, gained frame signal is carried out to periodic extension, obtains autocorrelation function:

R [k] = Σ_{n = 0}^{N - 1} \tilde{s_{t}} [n] \tilde{s_{t}} [n + k], k = 0,1, . . ., N - 1;

Expression formula by autocorrelation function can draw, R[k] be the dot product of two N dimensional vectors,

x_{0} = {\tilde{s_{t}} [0], \tilde{s_{t}} [1], . . ., \tilde{s_{t}} [N - 1]},

x_{k} = {\tilde{s_{t}} [k], . . ., \tilde{s_{t}} [N - 1], \tilde{s_{t}} [0], . . ., \tilde{s_{t}} [k - 1]},

R [k] = x_{0}^{T} x_{k} = {| | x | |}^{2} \cos (θ_{k}),

Wherein, || x|| ²=|| x ₀|| ²=|| x _k|| ², expression be frame energy, θ _kit is vector x ₀and vector x _kat the angle of N dimension space;

Step C, normalized coefficient of autocorrelation is carried out to the nonlinearities change of arc cosine, obtains phase coefficient:

P [k] = θ_{k} = \cos^{- 1} (\frac{R [k]}{{| | x | |}^{2}}),

P[k] span be between 0 to π, normalized to 0 to 1, obtain normalized Phase autocorrelation function:

P_{n} [k] = \frac{P [k]}{π} = \frac{\cos^{- 1} (R_{n} [k])}{π},

Wherein, P _n[k] is for improving the robustness in low signal-to-noise ratio situation.

3. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 1, is characterized in that, described step 2 comprises the following steps:

Step I, to P _n[k] carries out DFT conversion, obtains phase power spectrum Sp[l]:

S_{p} [l] = Σ_{k = 0}^{N - 1} P_{n} [k] \exp (- j \frac{2 π}{N} kl),

In formula, S _p[l] represents phase power spectrum, and the MEL frequency cepstral coefficient obtaining from formula is called phase place MFCC, by the filtering of Mel dimensions in frequency bank of filters, then carries out logarithm operation that is:;

Step II, when the information separated of each frequency band out after, frequency domain character is changed in time domain with discrete cosine transform, obtain phase place MFCC parameter; Described phase place MFCC parameter is chosen the static cepstrum coefficient in L rank and single order and second derivative, altogether 3L dimension.

4. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 1, is characterized in that, described step 3 comprises the following steps:

Step 1., have j reference voice data template in reference database, wherein comprises the MFCC proper vector of 3M dimension and the phase place MFCC proper vector of 3L dimension; The test template of proper vector 3M dimension MFCC and the wherein Euclidean distance between i reference template are D _mi, the Euclidean distance between test template and the i reference template of proper vector 3L dimension phase place MFCC is P _li, i=0,1 ..., j-1;

Step 2., under different signal to noise ratio (S/N ratio) conditions, adopt different weight values, obtain the weight distance value C of two feature vectors in template spacing _i:

C _i=(1-w)D _Mi+wP _Li，i=0,1,...,j-1，

Wherein, w is the weight of phase place MFCC parameterized template spacing; Template matches process refers to search in j reference template, finds and makes min{C _i, i=0,1 ..., the template that j-1 sets up;

Signal to noise ratio snr can be obtained by following formula:

SNR = \log_{10} (\frac{{| | Y | |}^{2}}{{| | N | |}^{2}}) &cong; \log_{10} (\frac{{| | Y | |}^{2}}{{| | N | |}^{2}}),

{| | Y | |}^{2} = {| | X | |}^{2} + {| | N | |}^{2} &cong; {| | X | |}^{2} + \overset{&OverBar;}{{| | N | |}^{2}},

Wherein, || Y|| ²what represent is the frame energy of voice in actual environment, || N|| ²what represent is the energy of the noise signal of sampling in actual environment,

represent the estimated value to this energy;

The value of w is determined by signal to noise ratio snr:

w=f(SNR)，

Wherein, f (SNR) represents the relation between weight coefficient w and signal to noise ratio snr, and the span of f (SNR) be (0,1), and f (SNR) and the pass of x are linearity or nonlinear negative correlation each other.

5. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 4, is characterized in that, described f (SNR) is as follows with the expression formula of the relation of x:

w = f (SNR) = \exp (- \frac{SNR - α}{γ}) \cdot u (SNR - α) + u (SNR - α),

Wherein, u () is step function, and α span is (1,5), α is the threshold value of SNR, and in the time that SNR is less than α, weight coefficient w is 1, in the time that SNR is greater than α, weight coefficient w and SNR negative correlation, and along index decreased, along with the growth of SNR, w converges on 0 gradually.

6. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 4, is characterized in that, described f (SNR) is as follows with the expression formula of the relation of x:

w = f (SNR) = 1 - \frac{1}{1 + \exp [- \frac{(SNR - β)}{θ}]},

Wherein, the span of β is (1,10), be traditional MFCC and phase place MFCC weight equate time SNR critical value; The span of γ and θ is (0.1,1), and γ and θ be all for regulating the speed of variation, in the time that the value of γ or θ is larger, changes just slower.

7. realize the device of the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling claimed in claim 1 for one kind, it is characterized in that, comprising: power module, display module, memory module, DSP/ARM digital signal processing module, microphone, A/D converter and USB interface; One end of described memory module, USB interface, display module, power module and A/D converter is all electrically connected with DSP/ARM digital signal processing module, the other end electrical connection of described microphone and A/D converter; Described microphone is for input test voice, described A/D converter is used for tested speech digitizing, described DSP/ARM chip is used for extracting feature and carries out template matches, described memory module is for reference data stored storehouse, described display module is used for showing result, and described USB interface is connected with computer.

8. device according to claim 7, is characterized in that, described A/D converter adopts ADC0832 chip; Described DSP/ARM digital signal processing module adopts DSP/ARM7 chip.

9. device according to claim 8, is characterized in that, described DSP/ARM7 chip adopts the TMS320C6711 of TI or the ARM7S3C44B0 of Samsung.