CN108538310A

CN108538310A - It is a kind of based on it is long when power spectrum signal variation sound end detecting method

Info

Publication number: CN108538310A
Application number: CN201810266002.3A
Authority: CN
Inventors: 张涛; 刘阳; 任相赢
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2018-09-14
Anticipated expiration: 2038-03-28
Also published as: CN108538310B

Abstract

It is a kind of based on it is long when power spectrum signal variation sound end detecting method：Framing adding window is carried out to input signal；Power spectrum is calculated to the signal after framing adding window；Power spectrum signal changing value when calculating long；Using it is long when power spectrum signal changing value carry out threshold value judgement；Threshold value update is carried out, is that adaptive updates are carried out to threshold value using the threshold value court verdict of 80 frame signals in the past；Ballot judgement, current goal frame are m frames, power spectrum signal changing value L when at this time long_x(m) it is codetermined by whole 1 frame signals of R before present frame and present frame, then the judgement of R subthresholds is taken part in altogether for current goal frame, each result is respectively labeled as D_m, D_m+1,…,D_m+R‑1If the result in the judgement of this R subthreshold being more than 80% is comprising speech frame, it is speech frame to adjudicate current goal frame, is otherwise noise frame；It repeats the above process until input signal terminates.The present invention can be obviously improved babble and the Detection accuracy under machine gun noise circumstances.

Description

It is a kind of based on it is long when power spectrum signal variation sound end detecting method

Technical field

The present invention relates to a kind of sound end detecting methods.More particularly to it is a kind of based on it is long when power spectrum signal variation Sound end detecting method.

Background technology

Speech terminals detection refers to that voice segments and non-speech segment are distinguished in noise circumstance, is voice coding, speech enhan-cement With the key technology of the field of voice signal such as speech recognition.

Currently, sound end detecting method can be mainly divided into two major classes：The method [1] of feature based and be based on engineering Practise the method with pattern-recognition.Wherein the method for feature based was extensively studied and answers because of its simple, quick the advantages that With.

1, the speech terminals detection based on voice temporal characteristics

Early stage mainly has for the feature of speech terminals detection：Short-time energy and Average zero-crossing rate, spectrum entropy and cepstrum distance Deng.Detection result of the such methods in the higher environment of signal-to-noise ratio is ideal, but the detection performance meeting when noise is relatively low Drastically decline.In order to improve the noise immunity and robustness of algorithm, related scholar proposes a series of new methods.As being based on making an uproar The sound end detecting method that sound inhibits；Merge the speech terminals detection side of Fisher linear discriminants and Mel frequency cepstral coefficients Method etc..

2, the speech terminals detection based on voice characteristic when long

The above method mostly be voice-based short-time characteristic, do not fully consider voice it is long when change information.In order to more Good land productivity voice it is long when characteristic, Ghosh etc., which is proposed, a kind of being based on long duration change rate signal (Long-term Signal Variability, LTSV) feature detection method, this method have stronger noise adaptation, and Arctic ice area (- Voice segments and non-speech segment can be still efficiently differentiated under 10dB)；MaY et al. proposes to be based on long duration signal spectrum flatness The speech terminals detection of (Long-term Spectral Flatness Measure, LSFM) feature, by estimating long duration language Sound distinguishes voice and noise in the spectrum flatness of different frequency bands, improves in noisy voice (babble) and machine gun (machine Gun) accuracy rate under nonstationary noises and the robustness under different noise circumstances such as.Although above two method makes an uproar to difference Have preferable robustness under sound, but the detection performance under low signal-to-noise ratio still has the space of promotion, particularly with babble and The slightly worse nonstationary noise of both detection performances of machine gun.

Invention content

The technical problem to be solved by the invention is to provide it is a kind of can be promoted based on it is long when phonetic feature sound end Robustness of the detection algorithm under different noise circumstances improves the detection property under the noise circumstances such as babble and machine gun Can based on it is long when power spectrum signal variation sound end detecting method.

The technical solution adopted in the present invention is：It is a kind of based on it is long when power spectrum signal variation speech terminals detection side Method includes the following steps：

1) framing adding window is carried out to input signal；

2) power spectrum is calculated to the signal after framing adding window；

3) power spectrum signal changing value when calculating long；

4) using it is long when power spectrum signal changing value carry out threshold value judgement；

5) threshold value update is carried out, is that adaptive updates are carried out to threshold value using the threshold value court verdict of 80 frame signals in the past；

6) ballot judgement, current goal frame are m frames, power spectrum signal changing value L when at this time long_x(m) by present frame And whole R-1 frame signals codetermine before present frame, then take part in the judgement of R subthresholds altogether for current goal frame, every time As a result it is respectively labeled as D_m, D_m+1,…,D_m+R-1If the result in the judgement of this R subthreshold being more than 80% is to sentence comprising speech frame Certainly current goal frame is speech frame, is otherwise noise frame；

7) step 1)~step 6) is repeated until input signal terminates.

Step 2) is to use classical period map method by calculating the short of the signal each frame input signal x (n) respectively When discrete Fourier transform acquire the frame signal in frequency w_kPower spectrum, the i-th frame signal is in frequency w_kPower spectral representation such as Under：

In formula, N_WIt indicates per frame data length, N_SHIt indicates per frame data movable length, h (l) indicates that length is N_WWindow letter Number.

The specific calculating process of step 3) is as follows：

Wherein, L_x(m) indicate m frame signals it is long when power spectrum signal changing value, N_FFTFourier transformation points are represented,It indicates power spectrum variation degree of the whole R frame signals at k-th of frequency point in the past, is by past whole R frame signal Arbitrary two frame between power spectrum variable quantity at k-th of frequency point be averaging to obtain, corresponding calculation formula is as follows：

Wherein S_x(j, w_k) and S_x(i, w_k) power spectrum of jth frame and the i-th frame signal at k-th of frequency point is indicated respectively.

Step 4) is power spectrum signal changing value L when utilizing long_x(m), it adjudicates in current whole R frame signals and whether contains language Sound frame, if L_x(m) it is more than the threshold value of setting, expression contains speech frame, marks D at this time_mIt is denoted as 1, otherwise indicates to be free of speech frame, Mark D_mIt is denoted as 0.

Step 5) is specifically to design two buffer B_N(m) and B_S+N(m), judgement is noise in 80 frames of storage past respectively Frame and speech frame it is long when power spectrum signal changing value, threshold adaptive more new formula is as follows：

T (m)=α min (B_S+N(m))+(1-α)max(B_N(m))

α is weight parameter.

To start 50 frames as initial background noise, according to initial background noise initial threshold value：

T_init=μ_N+pσ_N

Wherein μ_NAnd σ_NIndicate that the average value and standard deviation of power spectrum signal changing value when 50 frame ambient noises are long, p are respectively Weighting coefficient.

The present invention it is a kind of based on it is long when power spectrum signal variation sound end detecting method, babble can be obviously improved With the Detection accuracy under machine gun noise circumstances.By using the method for adaptive updates threshold value, it is solid to overcome tradition Determine the poor disadvantage of threshold value environmental suitability.Through experimental test, accuracy rate of the invention is integrally better than LTSV, the end-speech of LSFM Point detecting method.Under machine gun noise circumstances, speech terminals detection accuracy rate of the invention is substantially better than LTSV, The sound end detecting method of LSFM, average detected accuracy rate are improved more than 10%.

Description of the drawings

Fig. 1 be the present invention it is a kind of based on it is long when power spectrum signal variation sound end detecting method flow chart；

Fig. 2 is the schematic diagram of judgement of voting in the present invention；

Fig. 3 is the VAD results under different noise circumstances.

Specific implementation mode

With reference to embodiment and attached drawing to the present invention it is a kind of based on it is long when power spectrum signal variation sound end examine Survey method is described in detail.

The present invention it is a kind of based on it is long when power spectrum signal variation sound end detecting method, include the following steps：

1) framing adding window is carried out to input signal, since voice signal is a kind of typical non-stationary signal, but and sound The speed of wave vibration is compared, and the movement of phonatory organ is very slow, it is generally recognized that in 10ms~30ms periods, voice signal It is stationary signal, therefore blocks sub-frame processing to measured signal；

2) power spectrum is calculated to the signal after framing adding window；Specifically respectively to each frame input signal x (n) using classical Period map method acquire the frame signal in frequency w by calculating the discrete Fourier transform in short-term of the signal_kPower spectrum, i-th Frame signal is in frequency w_kPower spectral representation it is as follows：

3) power spectrum signal changing value when calculating long；Power spectrum signal running parameter is by the current of input signal x (n) when long The power spectrum of whole R-1 frame signals codetermines before frame and present frame, reflects the power spectrum of signal in the non-flat of past R frame Stability.The specific calculating process of power spectrum signal changing value is as follows when long：

4) using it is long when power spectrum signal changing value carry out threshold value judgement；It is power spectrum signal changing value L when utilizing long_x (m), it adjudicates in current whole R frame signals and whether contains speech frame, if L_x(m) it is more than the threshold value of setting, expression contains speech frame, D is marked at this time_mIt is denoted as 1, otherwise indicates to be free of speech frame, marks D_mIt is denoted as 0.

5) threshold value update is carried out, is that adaptive updates are carried out to threshold value using the threshold value court verdict of 80 frame signals in the past； Specifically design two buffer B_N(m) and B_S+N(m), the length for noise frame and speech frame is adjudicated in 80 frames of storage past respectively When power spectrum signal changing value, threshold adaptive more new formula is as follows：

T (m)=α min (B_S+N(m))+(1-α)max(B_N(m))

Best results when α is α=0.3 in weight parameter emulation experiment.

T_init=μ_N+pσ_N

Wherein μ_NAnd σ_NIndicate that the average value and standard deviation of power spectrum signal changing value when 50 frame ambient noises are long, p are respectively Weighting coefficient, best results when p=3 in emulation experiment.

6) ballot judgement, due to having counted the long duration feature of signal, so carrying out needing to consider when end-point detection judgement The information of front and back frame.Ballot judgement schematic diagram is as shown in Fig. 2, current goal frame is m frames, and power spectrum signal becomes when at this time long Change value L_x(m) it is codetermined by whole R-1 frame signals before present frame and present frame, then R is taken part in altogether for current goal frame Subthreshold is adjudicated, and each result is respectively labeled as D_m, D_m+1,…,D_m+R-1If in the judgement of this R subthreshold being more than 80% result It is speech frame comprising speech frame, then to adjudicate current goal frame, is otherwise noise frame；

7) step 1)~step 6) is repeated until input signal terminates.

Specific example is given below：

According to flow chart shown in FIG. 1, to the present invention it is a kind of based on it is long when power spectrum signal variation sound end examine Survey method carries out instance analysis, voice signal 20 speakers in TIMIT sound banks, 10 men, 10 female, each speaker couple 10 sentences are answered, and endpoint (0 represents noise segment, and 1 represents voice segments) is manually marked to each sentence.Due to sentence in TIMIT Shorter (about 3.5 seconds), and most of is voice, therefore mute section of 1 second is added in testing before each sentence, in order to count The characteristic parameter of noise simultaneously initializes decision threshold.Noise be selected from NOISEX-92 noises library, here select white, pink, Tetra- kinds of noises of babble and machine gun.And the testing algorithm performance under the noise circumstance of -5,0,5 and 10dB respectively, this In using Detection accuracy as performance indicator, Detection accuracy is defined as：

Wherein, mistake frame number includes that speech frame is mistaken for noise frame number and noise frame is mistaken for number of speech frames.

Example is specific as follows：

1, voice signal is read, and carries out framing windowing process, per 512 sampled points of frame, adds 512 points of Hamming window, frame It is 256 sampled points to move.

2,512 Fourier transformations are carried out to every frame data after adding window, calculates every frame data power spectrum parameters S_x(i, ω_k)。

3, according to power spectrum signal S_x(i,ω_k) long per frame signal of statistics when power spectrum signal changing value L_x(m), and it is sharp With the background noise information initial threshold value T of incipient stage_init。

4, L is utilized_x(m) threshold value judgement is carried out, adjudicates in current R frame signals whether contain speech frame, if L_x(m) it is more than and sets Determine threshold value, expression contains speech frame, at this time D_mIt is denoted as 1, otherwise indicates to be free of speech frame, D_mIt is denoted as 0.

5, adaptive updates are carried out to decision threshold using the threshold value court verdict of 80 frame signal of past.

6, D is utilized_mParameter is that current goal frame carries out ballot judgement.As shown in Fig. 2, the R for including target frame information Frame threshold value is adjudicated, if the result for being more than 80% is comprising speech frame, it is speech frame to adjudicate target frame, is otherwise noise frame.

Select two sections of voices at random from TIMIT sound banks, the results are shown in Figure 3 by the VAD under 0bB noise circumstances.Its Middle a1, b1, c1 and d1 indicate the speech waveform after white, pink, babble and machine gun noises of addition 0dB respectively Figure, a2, b2, c2 and d2 indicate corresponding VAD results.

Under the noise circumstance of different signal-to-noise ratio, power spectrum signal when counted respectively based on LTSV, LSFM and being based on long The speech terminals detection accuracy rate of changing value, as shown in table 1.As can be seen from the table, in white, pink and babble noise Under environment, three kinds of method detection performances relatively, based on it is long when power spectrum signal changing value speech terminals detection accuracy rate Slightly it is better than other two method.But under machine gun noise circumstances, the speech terminals detection accuracy rate based on LSVM is obviously excellent In other two method.

1 result statistical form of table

Claims

1. it is a kind of based on it is long when power spectrum signal variation sound end detecting method, which is characterized in that include the following steps：

1) framing adding window is carried out to input signal,；

2) power spectrum is calculated to the signal after framing adding window；

3) power spectrum signal changing value when calculating long；

6) ballot judgement, current goal frame are m frames, power spectrum signal changing value L when at this time long_x(m) by present frame and currently Whole R-1 frame signals codetermine before frame, then take part in the judgement of R subthresholds, each result point altogether for current goal frame Biao Ji not be_m, D_m+1,…,D_m+R-1If the result in the judgement of this R subthreshold being more than 80% is comprising speech frame, judgement is current Target frame is speech frame, is otherwise noise frame；

7) step 1)~step 6) is repeated until input signal terminates.

2. it is according to claim 1 it is a kind of based on it is long when power spectrum signal variation sound end detecting method, feature Be, step 2) be respectively to each frame input signal x (n) using classical period map method by calculate the signal in short-term from Scattered Fourier transformation acquires the frame signal in frequency w_kPower spectrum, the i-th frame signal is in frequency w_kPower spectral representation it is as follows：

In formula, N_WIt indicates per frame data length, N_SHIt indicates per frame data movable length, h (l) indicates that length is N_WWindow function.

3. it is according to claim 1 it is a kind of based on it is long when power spectrum signal variation sound end detecting method, feature It is, the specific calculating process of step 3) is as follows：

Wherein, L_x(m) indicate m frame signals it is long when power spectrum signal changing value, N_FFTFourier transformation points are represented,Table Show power spectrum variation degree of the whole R frame signals at k-th of frequency point in the past, is by the arbitrary of past whole R frame signal Power spectrum variable quantity between two frames at k-th of frequency point is averaging to obtain, and corresponding calculation formula is as follows：

4. it is according to claim 1 it is a kind of based on it is long when power spectrum signal variation sound end detecting method, feature It is, step 4) is power spectrum signal changing value L when utilizing long_x(m), it adjudicates in current whole R frame signals and whether contains voice Frame, if L_x(m) it is more than the threshold value of setting, expression contains speech frame, marks D at this time_mIt is denoted as 1, otherwise indicates to be free of speech frame, mark Remember D_mIt is denoted as 0.

5. it is according to claim 1 it is a kind of based on it is long when power spectrum signal variation sound end detecting method, feature It is, step 5) is specifically to design two buffer B_N(m) and B_S+N(m), respectively storage in the past in 80 frames judgement be noise frame and Speech frame it is long when power spectrum signal changing value, threshold adaptive more new formula is as follows：

T (m)=α min (B_S+N(m))+(1-α)max(B_N(m))

α is weight parameter.

T_init=μ_N+pσ_N

Wherein μ_NAnd σ_NIndicate that the average value and standard deviation of power spectrum signal changing value when 50 frame ambient noises are long, p are weighting respectively Coefficient.