CN100573663C

CN100573663C - Mute detection method based on speech characteristic to jude

Info

Publication number: CN100573663C
Application number: CNB2006100396964A
Authority: CN
Inventors: 都思丹; 薛卫; 周余; 孔令红; 叶迎宪; 赵康涟
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2006-04-20
Filing date: 2006-04-20
Publication date: 2009-12-23
Anticipated expiration: 2026-04-20
Also published as: CN1835073A

Abstract

The invention discloses a kind of mute detection method, at first extract the multi-threshold zero-crossing rate of a frame voice data based on phonetic feature identification; Carry out anticipation with weighting multi-threshold zero-crossing rate to quiet, determine significantly quiet; Extract the compound characteristics of a frame voice data, compound characteristics comprises zero-crossing rate, short-time energy value, based on the Mel yardstick cepstrum coefficient of multiresolution spectrum; With two category support vector machines the compound characteristics of audio frequency is differentiated, a class result is a normal voice, and another kind of is quiet.The present invention can improve the silence detection success ratio, and can be discerned some specific human voices.The present invention is widely used in particularly having vast market prospect in voice-enabled chat, video conference in the voice-over-net conversation.

Description

Mute detection method based on speech characteristic to jude

One, technical field

The present invention relates to audio-frequency processing method, specifically a kind of mute detection method that is used for the voice-over-net conversation based on speech characteristic to jude.

Two, background technology

In the people spoke process, its sound can be divided into quiet and speech two parts, and it is quiet that 60% time was on average arranged.And when many people talked, each had only a people to speak constantly basically, and other people then shows as quiet.Noise (comprising pneumatic noise) quiet and that introduced by voice capture device is the same with speech data all to be transmitted in network, causes the reduction of voice quality.Use quiet inhibition technology, can eliminate quiet part, can save the transmission bandwidth more than 50%, reduce network congestion.

Existing mute detection method comprises extraction audio signal characteristic value and relatively judges quiet with pre-set threshold value, the parameter that the tradition mute detection method uses comprise short-time zero-crossing rate, short-time energy, coefficient of autocorrelation but voice signal and some ambient noise signal have non-stationary, thereby system recognition rate weak effect; And, because threshold value is fixed, can not well adapt to different noises, so these detection system discriminations are not high.

In addition, along with popularizing of voice-over-net conversation, most application concentrates on the PC platform, for easy-to-use, speech side generally all can select to wear headset and exchange, this just causes microphone very near from people's nose, mouth, and the air-flow that people's general breathing produces can enter microphone and produce audio stream.Though this sound signal is more weak, it also is a kind of voice, and some mute detection methods (for example G.729B, G.723.1A etc.) commonly used at present can be identified as normal voice with the part pneumatic noise, have further reduced the detection system discrimination.

Three, summary of the invention

The purpose of this invention is to provide a kind of mute detection method based on phonetic feature identification, this mute detection method can improve the silence detection success ratio, and can be discerned some specific human voices.

The objective of the invention is to be achieved through the following technical solutions:

A kind of mute detection method based on speech characteristic to jude is characterized in that it comprises following steps:

(1) extracts the multi-threshold zero-crossing rate of a frame voice data, and it is sued for peace with preferred weighted value.Multi-threshold zero-crossing rate detection method is established 3 thresholding T that height is different ₁, T ₂, T ₃,, and T ₁＜T ₂＜T ₃, each frame is asked corresponding to T respectively with formula (1) ₁, T ₂, T ₃Three kinds of thresholding zero-crossing rate Z ₁, Z ₂And Z ₃

Z _n＝∑{|sgn[x(n)-T _n]-sgn[x(n-1)-T _n]+|sgn[x(n)+T _n]-sgn[x(n-1)+T _n]} (1)

Total zero-crossing rate Z is expressed from the next: ^*W (n-w)

Z＝W ₁Z ₁+W ₂Z ₂+W ₃Z ₃

Wherein: W ₁, W ₂, W ₃Be the zero-crossing rate weights; Z ₀Be defined as total zero-crossing rate cut off value.

(2) carry out anticipation with multi-threshold zero-crossing rate weighted sum to quiet, if total zero-crossing rate Z of a frame voice data is less than setting threshold Z ₀, just judge that it is quiet, otherwise being transferred to step (3), handles this frame.

(3) extract the compound characteristics of a frame voice data, compound characteristics comprises zero-crossing rate, short-time energy value, based on the Mel yardstick cepstrum coefficient of multiresolution spectrum; Calculating based on the Mel yardstick cepstrum coefficient of multiresolution spectrum comprises: wavelet decomposition and reconstruct, Fourier transform, Mel yardstick cepstrum extraction module.Mel yardstick cepstrum coefficient (C _MFCC) computing formula is as follows:

c_{MFCC} (i) = \sqrt{\frac{2}{L}} Σ_{l = 1}^{L} \log m (l) \cos {(l - \frac{1}{2}) \frac{iπ}{L}} - - - (2)

Wherein:

m (l) = Σ_{k = o (l)}^{h (l)} W_{l} (k) | X_{n} (k) |, l = 1,2, . . ., L - - - (3)

W_{l} (k) = \{\begin{matrix} \frac{k - o (l)}{c (l) - o (l)} & o (l) \leq k \leq c (l) \\ \frac{h (l) - k}{h (l) - c (l)} & c (l) \leq k≤ h (l) \end{matrix} - - - (4)

In the formula, o (l), c (l) and h (l) are respectively lower limit, center and the upper limiting frequency of l triangle filter

(4) with two category support vector machines the compound characteristics of audio frequency is differentiated, obtain normal voice and quiet two class results,, be sent to the take over party after the compression for normal voice, for quiet, only in partial frame, add behind the adaptive noise compression and be sent to the take over party.

The present invention detects voice stage by stage by extracting multiple speech parameter, and effectively anticipation is quiet.Detect by subsequent step for the voice data of in step (2), failing to discern, in the step (3) for the overall spectrum feature of picked up signal, earlier this frame voice data is carried out wavelet decomposition, reconstruct and Fourier transform and form multiresolution spectrum, and extract the final audio frequency characteristics of Mel yardstick cepstrum conduct of this frequency spectrum.Step is differentiated the compound characteristics of voice data with support vector machine in (4), obtains the final decision result.Compared with prior art, the present invention uses support vector machine audio frequency characteristics sorting technique, with respect to traditional sorting technique, have more strict theoretical foundation, this method is applied in fields such as text classification, image recognitions, obtained than the better classifying quality of traditional machine learning method, the accuracy height of classification, and also this method has robustness preferably.

Four, description of drawings

Fig. 1 is the schematic flow sheet of the inventive method;

Fig. 2 is the schematic diagram that sound intermediate frequency compound characteristics of the present invention extracts;

Fig. 3 is a wavelet decomposition tree structure diagram among the present invention;

Five, embodiment

Below in conjunction with accompanying drawing the present invention is elaborated.

A kind of mute detection method based on speech characteristic to jude of the present invention was seen Fig. 1, adopts the sample frequency of 8kHz in the concrete testing process, detected 10 milliseconds of each frames as a frame with 80 o'clock.It comprises following steps:

(1) extracts the multi-threshold zero-crossing rate of a frame voice data, and it is sued for peace with preferred weighted value.In step (1), use total zero-crossing rate cut off value Z ₀With optimal weight vector (W ₁, W ₂, W ₃), their value must just set before silence detection.In order to determine their value, collect at least 2000 frame varying environment subaudio frequency data, wherein half is quiet, half is the speech voice.The quiet False Rate that detect to produce with the multi-threshold zero-crossing rate is an objective function, travels through each weight vectors and threshold value span, finds out to produce minimum weight vectors and the threshold value of False Rate, Here it is optimal weight vector sum threshold value Z ₀

(3) extract the compound characteristics of a frame voice data, compound characteristics comprises zero-crossing rate, short-time energy value, based on the Mel yardstick cepstrum coefficient of multiresolution spectrum; Based on the extraction of Mel yardstick cepstrum (MFCC) coefficient of multiresolution spectrum as shown in Figure 2.Adopt the Daubechies4 wavelet package transforms that the windowing signal decomposition is become the coefficient of 6 subbands to the time domain voice signal, be reconstructed coefficient size to the first time wavelet decomposition at each subband, as shown in Figure 3.And each sub-band coefficients carried out normalized, and subsequently coefficient is done the FFT conversion, multiresolution spectrum is formed in each sub-band coefficients summation, at last multiresolution spectrum is delivered the MFCC extraction module.MFCC is characterized as L=12, and the inner product function of support vector machine is selected radial basis function (σ for use ²=0.3), the training method of support vector machine can adopt the SMO method, and the present invention is also unrestricted to this.

(4) with two category support vector machines the compound characteristics of audio frequency is differentiated, obtained two class results, a class result is a normal voice, and another kind of is quiet (comprising pneumatic noise).For normal voice, system can be with g.729, g.723 waiting voice compressing method to compress and sending to the network take over party.

Among the present invention, be quiet frame for differentiating in step (2), the step (4), in actual use, do not transmit sound if make fully during quiet, can make the hearer feel uncomfortable, therefore need add some noises artificially and make hearer's communication of feeling not interrupt that the noise of adding need guarantee to make that transmit leg is consistent with the noise power of reciever, but be not each frame transmitted noise all when quiet, just transmission continuously the first quiet frame get final product.As for how transmitted noise the present invention is also unrestricted to this.

Claims

1, a kind of mute detection method based on speech characteristic to jude is characterized in that it comprises following steps:

(1) extracts the multi-threshold zero-crossing rate of a frame voice data, and, obtain total zero-crossing rate Z its weighted value summation;

(2) carry out anticipation with multi-threshold zero-crossing rate weighted sum to quiet, total zero-crossing rate Z of a frame voice data is less than setting threshold Z ₀, judge that it is quiet, otherwise being transferred to step (3), handles this frame;

(3) extract the compound characteristics of a frame voice data, compound characteristics comprises zero-crossing rate, short-time energy value, based on the Mel yardstick cepstrum coefficient of multiresolution spectrum;

2, according to right 1 described mute detection method, it is characterized in that based on speech characteristic to jude: in the step (1), extract 3 multi-threshold zero-crossing rates of voice data, and to its weighted value summation.

3, according to right 1 described mute detection method, it is characterized in that based on speech characteristic to jude: in the step (4), the described quiet pneumatic noise that comprises.