CN102097095A

CN102097095A - Speech endpoint detecting method and device

Info

Publication number: CN102097095A
Application number: CN2010106095030A
Authority: CN
Inventors: 苏伟博
Original assignee: Tianjin Yaan Technology Electronic Co Ltd
Current assignee: Tianjin Yaan Technology Electronic Co Ltd
Priority date: 2010-12-28
Filing date: 2010-12-28
Publication date: 2011-06-15

Abstract

The invention belongs to the field of video monitoring, and provides a speech endpoint detecting method and device. The method comprises the following steps: sampling data of an input speech signal, and preprocessing the sampled speech signal; adding a Hamming window to the preprocessed speech signal for framing and recording as Rn (n is more than 0 and less than or equal to N), wherein N is the total number of frames; calculating a frequency spectrum information entropy of the n-th speech signal; and determining the frame as a speech frame if the frequency spectrum information entropy of the n-th speech signal is more than a set threshold value, and otherwise, determining the frame as a non-speech frame. The method applies the frequency spectrum entropy as a characteristic for distinguishing a speech frame from a non-speech frame, can effectively distinguish speech frames from non-speech frames, and has a good detection effect for low signal to noise ratio environments, so the defects that the traditional frequency spectrum entropy-based algorithm only considers the frequency spectrum information of the current frame, the noise frequency spectrum information entropy greatly fluctuates in a non-stationary noise environment, and the difficulty of threshold value selection is increased can be overcome.

Description

A kind of sound end detecting method and device

Technical field

The invention belongs to field of video monitoring, relate in particular to a kind of sound end detecting method and device.

Background technology

At present, in real-time video monitoring, utilize the abnormal sound in the microphone pickup monitoring scene, point to the abnormal sound place, can realize the real-time monitoring of anomalous event thereby regulate camera optical axis.Because the omni-directional acoustic pickup can pick up the sound on all directions, therefore can effectively solve in the traditional video surveillance and occur in blind area, rig camera visual field owing to anomalous event, can not capture the drawback that anomalous event takes place rapidly.In video monitoring, utilize the abnormal sound in the microphone pickup monitoring scene, the most key first step is exactly the sound end detection technique.

Traditional end-point detecting method is as short-time energy, zero-crossing rate scheduling algorithm, based on the improvement algorithm that entropy, zero energy product, entropy combine with energy, better performances when stationary noise or high s/n ratio.Under low signal-to-noise ratio or non-stationary environment, easy and the noise aliasing of the short-time energy of voice, zero-crossing rate is distinguished voiceless sound and noise easily, but be difficult to distinguish voiced sound and noise, zero energy product method can improve the robustness of end-point detection to a certain extent in short-term, but zero energy product characteristic parameter noise robustness is not as good as information entropy in short-term, say to a certain extent, the spectrum entropy has certain robustness to noise, but when signal to noise ratio (S/N ratio) descends, though the shape of spectrum entropy remains unchanged, but the spectrum entropy reduces, and tradition only considers the spectrum information of present frame based on the method for spectrum entropy, and noise spectrum information entropy fluctuation range is very big under the noise circumstance of non-stationary, and this has brought difficulty to selection of threshold.

Summary of the invention

The object of the present invention is to provide and a kind ofly can effectively distinguish voice and non-speech frame, the sound end detecting method of quite good detecting effectiveness is also arranged for the low signal-to-noise ratio environment.

The embodiment of the invention is achieved in that a kind of sound end detecting method, and described detection method comprises:

Input speech signal is carried out data sampling, and the voice signal after the sampling is carried out pre-service;

Pretreated voice signal is added Hamming window carry out the processing of branch frame, be designated as R ⁿ(0＜n≤N), N is the sum of frame;

Calculate the spectrum information entropy of n frame voice signal;

If the spectrum information entropy of n frame voice signal greater than preset threshold, is judged to be speech frame with this frame, otherwise is judged to be non-speech frame.

The present invention also aims to provide a kind of sound end pick-up unit, it is characterized in that, described pick-up unit comprises:

The voice signal sample processing unit is used for input speech signal is carried out data sampling, and the voice signal after the sampling is carried out pre-service;

Voice signal divides frame processing unit, pretreated voice signal is added Hamming window carry out the processing of branch frame, is designated as R ⁿ(0＜n≤N), N is the sum of frame;

Spectrum information entropy computing unit is used to calculate the spectrum information entropy of n frame voice signal;

The speech frame determining unit if the spectrum information entropy that is used for n frame voice signal greater than preset threshold, is judged to be speech frame with this frame, otherwise is judged to be non-speech frame.

Advantage of the present invention and good effect are:

The present invention has used the distinguishing characteristic of frequency spectrum entropy as voice and non-voice, can effectively distinguish speech frame and non-speech frame, for the low signal-to-noise ratio environment quite good detecting effectiveness is arranged also, overcome traditional spectrum information of only considering present frame based on the algorithm of frequency spectrum entropy, noise spectrum information entropy fluctuation is very big under the noise circumstance of non-stationary, increased the problem of the difficulty that threshold value selects.

Description of drawings

Fig. 1 is the realization flow figure of the sound end detecting method that provides of the embodiment of the invention;

Fig. 2 is the realization flow figure of the first embodiment of the present invention;

Fig. 3 is the structured flowchart of the sound end pick-up unit that provides of the embodiment of the invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

The embodiment of the invention has proposed sound end detecting method under the low signal-to-noise ratio of a kind of monitoring field.This method is with the distinguishing characteristic of subband spectrum entropy as voice and non-speech frame, at first every frame voice signal is carried out wavelet decomposition, obtain the subband signal of different frequency range, then these subband signals are carried out the FFT conversion, calculate the frequency spectrum entropy of each subband respectively, front and back are carried out smoothing processing at a distance of the subband spectrum entropy of some frames by one group of order statistics wave filter, calculate the frequency spectrum entropy of every frame, judge speech frame and non-speech frame according to its value and preset threshold, in order to improve the precision of algorithm, threshold value is carried out self-adaptation revise.

Fig. 1 shows the process flow diagram of the sound end detecting method that the embodiment of the invention provides.This method comprises:

In step S101, input speech signal is carried out data sampling, and the voice signal after the sampling is carried out pre-service;

In step S102, pretreated voice signal is added Hamming window carry out the processing of branch frame, be designated as R ⁿ(0＜n≤N), N is the sum of frame;

In step S103, calculate the spectrum information entropy of n frame voice signal;

In step S104,, otherwise be judged to be non-speech frame if the spectrum information entropy of n frame voice signal greater than preset threshold, is judged to be speech frame with this frame.

In step S105,, otherwise turned back to for the 2nd step if n＞N then algorithm finish.

As the first embodiment of the present invention, as shown in Figure 2, a kind of sound end detecting method specifically may further comprise the steps:

In step S201, input speech signal is carried out data sampling, because voice signal mainly concentrates on below the 8kHz, adopt the sample frequency of 11.025kHz in embodiments of the present invention as voice signal.

In step S202, the voice signal after the sampling carries out some pre-service, carries out pre-emphasis and can promote HFS, makes the smooth of signal spectrum change, is convenient to carry out spectrum analysis.Reducing the low level influence is because the voice signal of acoustic pickup collection is a negative value, makes it deduct intermediate value, and the voice central shaft is near zero point.Voice time domain amplitude is carried out normalization.

In step S203, pretreated voice signal is added Hamming window carry out the processing of branch frame, the general 20～30ms of frame length, frame moves general 10～20ms, is designated as R ⁿ(0＜n≤N), N is the sum of frame.Wherein the Hamming window expression formula is:

W (n) = \{\begin{matrix} 0.54 - 0.46 \cos [2 πn / (N - 1)], (0 \leq n \leq N - 1) \\ 0, (n = else) \end{matrix}

In step S204, n frame R ⁿVoice signal selects for use db3 series wavelet basis function to carry out five layers of decomposition, obtains the subband signal of six different frequency ranges

(0＜m≤6), (0＜k≤q (m)), q (m) refers to the length of m subband signal, m 6 is designated as the low frequency sub-band signal after the wavelet decomposition, in high-frequency sub-band, m from small to large, frequency reduces successively.

In step S205, each subband signal carries out obtaining corresponding power spectrum after the FFT conversion The FFT change point of wherein every straton band signal is several different according to the subband signal number, first to the 4th straton band signal FFT conversion count value be respectively 512,256,128,64, the five and the conversion of layer 6 subband signal count and get 32.

In step S206, at first calculate the energy of each subband signal, its computing formula is:

E_{m}^{n} = Σ_{k = 1}^{q (m)} X_{m}^{n} {(k)}^{2}, (k = 1,2 Lq (m))

Wherein q (m) refers to the length of m subband signal,

Be k point of m subband of n frame,

It is the energy of m subband of n frame.

Secondly, calculate the probability of each point of each subband signal, its computing formula is:

p_{k}^{m} = \frac{X_{m}^{n} (k) + Q}{E_{m}^{n} + Q}, (k = 1,2 Lq (m))

Wherein,

Refer to the probability of k point of m subband signal, Q is a bigger positive number.

At last, calculate the frequency spectrum entropy of each subband signal, its computing formula is:

{Es}_{m}^{n} = Σ_{k = 1}^{q (m)} p_{k}^{m} \log_{2} (p_{k}^{m}), (k = 1,2 Lq (m))

Wherein,

The subband spectrum entropy that refers to m subband signal of n frame.

In step S207, get the subband spectrum entropy of each L frame of n frame front and back, each subband spectrum entropy of this (2L+1) frame is made ascending sort respectively (L≤l≤L) obtains in this (2L+1) frame k maximal value in m the subband spectrum entropy

The frequency spectrum entropy of m the subband of n frame after then popin is slided is after filtration obtained by following formula

{Eh}_{m}^{n} = (1 - λ) {Es}_{k}_{m}^{n} + λg {Es}_{(k + 1)}_{m}^{n},

(0＜k≤2L+1)(0＜m≤6)

Wherein,

(0＜λ＜1).λ is called the sampling fractile of order statistics wave filter, and λ satisfies Gaussian distribution, and λ value 0.3 in the present embodiment, L value are 8.

In step S208, calculate the frequency spectrum entropy H of n frame _n, its computing formula is:

H_{n} = - \frac{1}{M} \cdot Σ_{m = 1}^{M} {Eh}_{m}^{n}

Wherein, M is 6.

In step S209, the average that initial threshold T gets preceding 10 frame frequencies spectrum entropy multiply by a correction factor a, and when n frame voice signal frequency spectrum entropy during greater than T, judgements present frame is a speech frame, otherwise the judgement present frame is a non-speech frame.The a value is 1.30 in the present embodiment.

In step S210, when detecting voice signal when speech frame enters non-speech frame, the average of getting 5 frame voice signal frequency spectrum entropys again multiply by a correction factor b as threshold value, realizes that the self-adaptation of threshold value is revised.The b value is 1.06 in the present embodiment.

In step S211,, otherwise turn back to step 204 if n＞N algorithm finishes.

Fig. 3 shows the structural representation of the sound end pick-up unit that the embodiment of the invention provides.For convenience of explanation, only show part related to the present invention.

This pick-up unit comprises:

Voice signal sample processing unit 31 is used for input speech signal is carried out data sampling, and the voice signal after the sampling is carried out pre-service;

Voice signal divides frame processing unit 32, pretreated voice signal is added Hamming window carry out the processing of branch frame, is designated as R ⁿ(0＜n≤N), N is the sum of frame;

Spectrum information entropy computing unit 33 is used to calculate the spectrum information entropy of n frame voice signal;

Speech frame determining unit 34 if the spectrum information entropy that is used for n frame voice signal greater than preset threshold, is judged to be speech frame with this frame, otherwise is judged to be non-speech frame.

As a preferred version of the embodiment of the invention, described voice signal sample processing unit 31 comprises:

Voice signal pre-emphasis module 311 is used for that the voice signal after the sampling is carried out some pre-service and carries out pre-emphasis, promotes HFS, makes the smooth of signal spectrum change, is convenient to carry out spectrum analysis;

Reducing low level influences module 312, is used for the voice signal after the sampling is reduced the low level influence, makes voice signal deduct intermediate value, and the voice central shaft is near zero point;

Time domain amplitude normalizing module 313 is used for the voice time domain amplitude of the voice signal after the sampling is carried out normalization.

As a preferred version of the embodiment of the invention, described Hamming window expression formula is:

W (n) = \{\begin{matrix} 0.54 - 0.46 \cos [2 πn / (N - 1)], (0 \leq n \leq N - 1) \\ 0, (n = else) \end{matrix}

As a preferred version of the embodiment of the invention, described spectrum information entropy computing unit 33 comprises:

Voice signal decomposing module 331 is used for n frame R ⁿVoice signal selects for use db3 series wavelet basis function to carry out five layers of decomposition, obtains the subband signal of different frequency range;

FFT conversion module 332 is used for each subband signal is carried out obtaining corresponding power spectrum after the FFT conversion;

Subband signal computing module 333 is used to calculate the energy of each subband signal, the probability of each point of each subband signal and the frequency spectrum entropy of each subband signal;

Frequency spectrum entropy smoothing processing module 334 is used for front and back are carried out smoothing processing at a distance of the subband spectrum entropy of some frames by one group of order statistics wave filter;

Frequency spectrum entropy computing module 335 is used to calculate the frequency spectrum entropy of every frame.

As a preferred version of the embodiment of the invention, described pick-up unit also comprises:

Threshold setting unit 35 is used for setting threshold and threshold value is carried out self-adaptation revise.The average that the initial threshold of threshold value is got the frequency spectrum entropy of preceding 10 frame subband signals multiply by a correction factor and obtains; When speech frame enters non-speech frame, described threshold value multiply by a coefficient by the average of getting some frame voice signal frequency spectrum entropys again and carries out self-adaptation and revise.

Beneficial effect of the present invention is:

1, to obtain the subband signal of different frequency range may be sub-band filter method the most easily to wavelet transformation, and choosing of wavelet basis function has very big dirigibility.

2, existing documents and materials prove, under the low signal-to-noise ratio environment, be better than method based on the algorithm of voice spectrum entropy based on energy, and traditional spectrum information of only considering present frame based on the algorithm of frequency spectrum entropy, noise spectrum information entropy fluctuation is very big under the noise circumstance of non-stationary, increased the difficulty that threshold value is selected.The present invention carries out smoothing processing with the subband spectrum entropy of the some frames in front and back by one group of order statistics wave filter, has overcome the shortcoming of tradition based on frequency spectrum entropy algorithm.

3, the present invention carries out the self-adaptation modification to the threshold value of choosing, and has increased the precision of end-point detection.

The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. a sound end detecting method at first carries out data sampling to input speech signal, and the voice signal after the sampling is carried out pre-service, then pretreated voice signal is added Hamming window and carries out the processing of branch frame, is designated as R ⁿ(0＜n≤N), N is the sum of frame, it is characterized in that, described detection method also comprises:

Calculate the spectrum information entropy of n frame voice signal;

2. detection method as claimed in claim 1 is characterized in that, described voice signal after the sampling is carried out pretreated implementation method and comprises:

Voice signal after the sampling is carried out some pre-service carry out pre-emphasis, promote HFS, make the smooth of signal spectrum change, be convenient to carry out spectrum analysis;

Voice signal after the sampling is reduced the low level influence, make voice signal deduct intermediate value, the voice central shaft is near zero point;

Voice time domain amplitude to the voice signal after the sampling is carried out normalization.

3. detection method as claimed in claim 1 is characterized in that, the spectrum information entropy of described calculating n frame voice signal may further comprise the steps:

To n frame R ⁿVoice signal selects for use wavelet basis function to carry out five layers of decomposition, obtains the subband signal of different frequency range;

Each subband signal is carried out obtaining corresponding power spectrum after the FFT conversion;

Calculate the energy of each subband signal, the probability of each point of each subband signal and the frequency spectrum entropy of each subband signal;

Front and back are carried out smoothing processing at a distance of the subband spectrum entropy of some frames by one group of order statistics wave filter;

Calculate the frequency spectrum entropy of every frame.

4. detection method as claimed in claim 1 is characterized in that, the average that the initial threshold of described threshold value is got the frequency spectrum entropy of preceding 10 frame subband signals multiply by a correction factor and obtains.

5. detection method as claimed in claim 1 is characterized in that, when when speech frame enters non-speech frame, described threshold value multiply by a coefficient by the average of getting some frame voice signal frequency spectrum entropys again and carries out self-adaptation and revise.

6. a sound end pick-up unit is characterized in that, described pick-up unit comprises:

7. pick-up unit as claimed in claim 6 is characterized in that, described voice signal sample processing unit comprises:

Voice signal pre-emphasis module is used for that the voice signal after the sampling is carried out some pre-service and carries out pre-emphasis, promotes HFS, makes the smooth of signal spectrum change, is convenient to carry out spectrum analysis;

Reducing low level influences module, is used for the voice signal after the sampling is reduced the low level influence, makes voice signal deduct intermediate value, and the voice central shaft is near zero point;

Time domain amplitude normalizing module is used for the voice time domain amplitude of the voice signal after the sampling is carried out normalization.

8. pick-up unit as claimed in claim 6 is characterized in that, described spectrum information entropy computing unit comprises:

The voice signal decomposing module is used for n frame R ⁿVoice signal selects for use db3 series wavelet basis function to carry out five layers of decomposition, obtains the subband signal of different frequency range;

The FFT conversion module is used for each subband signal is carried out obtaining corresponding power spectrum after the FFT conversion;

The subband signal computing module is used to calculate the energy of each subband signal, the probability of each point of each subband signal and the frequency spectrum entropy of each subband signal;

Frequency spectrum entropy smoothing processing module is used for front and back are carried out smoothing processing at a distance of the subband spectrum entropy of some frames by one group of order statistics wave filter;

Frequency spectrum entropy computing module is used to calculate the frequency spectrum entropy of every frame.

9. pick-up unit as claimed in claim 6 is characterized in that, described pick-up unit also comprises:

The threshold setting unit is used for when when speech frame enters non-speech frame, and described threshold value multiply by a coefficient by the average of getting some frame voice signal frequency spectrum entropys again and carries out self-adaptation and revise.