CN103489454B

CN103489454B - Based on the sound end detecting method of wave configuration feature cluster

Info

Publication number: CN103489454B
Application number: CN201310432146.9A
Authority: CN
Inventors: 杨莹春; 赵启明; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2013-09-22
Filing date: 2013-09-22
Publication date: 2016-01-20
Anticipated expiration: 2033-09-22
Also published as: CN103489454A

Abstract

The invention discloses a kind of sound end detecting method based on wave configuration feature cluster, comprise the steps: S01, obtain clean speech signal by primary speech signal; S02, obtains the envelope signal of described clean speech signal, and envelope signal is divided into some sound subsegments; S03, sound subsegment is carried out cluster by the wave configuration feature according to each sound subsegment, removes non-speech sounds subsegment; Whole sound subsegments of reserve part in step S03 are processed, obtain sound end by S04.The present invention uses relatively simple unsupervised clustering just to obtain good result quickly and accurately when using single features.

Description

Based on the sound end detecting method of wave configuration feature cluster

Technical field

The present invention relates to speech terminals detection field, particularly a kind of sound end detecting method based on wave configuration feature cluster.

Background technology

The development of current sound groove recognition technology in e reaches higher level, and speech terminals detection is speech analysis, a steps necessary in phonetic synthesis and Speaker Identification, in voice repetition system and speech recognition system, speech terminals detection technology has achieved a reasonable result, current existing end-point detection technology is very many, the feature of main use has short-time energy, zero-crossing rate, information entropy, sub belt energy, fundamental tone, time domain parameter, frequency domain parameter, cepstrum parameter etc., and the category of model method used also is varied, main has double threshold, neural network, wavelet model, hidden Markov model etc., but the problem of such as background noise in actual environment, effect does not also reach expectation.

Publication number is the end-point detecting method that patent document discloses a kind of speech recognition of 102148030A, and it comprises: gather ground unrest and noisy speech signal; The characteristic of analysis background noise and noisy speech signal; Extract parameter or its LPC (linearpredictivecoding) the i.e. linear forecast coding coefficient of ground unrest linear prediction model, as background noise linear prediction template; Determine the end points of noisy speech signal, by the linear predictor coefficient of every frame noisy speech and the parameter comparison of ground unrest template, and be treated to eigenwert.When the change of this eigenwert exceedes setting value, namely as detecting that the mark of sound end can also according to the change of ground unrest, namely revise ground unrest linear prediction model by its as background noise template.Under the method can realize band background noise environment very well, to the end-point detection of people's speech utterance signal.

The defect of the method is, needs to obtain multiple characteristic parameter, and its computation process is complicated.Therefore how by single characteristic parameter, carrying out speech sound signal terminal point detection comparatively accurately when there is ground unrest is the problem needing to solve.Meanwhile, wish when ground unrest is less, realize the detection of speech sound signal terminal point comparatively fast.

Summary of the invention

In order to carry out good speech terminals detection when there is ground unrest by single wave configuration feature, the invention provides a kind of sound end detecting method based on wave configuration feature cluster, comprising the steps:

S01, obtains clean speech signal by primary speech signal;

S02, obtains the envelope signal of described clean speech signal, and envelope signal is divided into some sound subsegments;

S03, sound subsegment is carried out cluster by the wave configuration feature according to each sound subsegment, and in cluster result, remove the sound subsegment of non-voice, retains remainder;

Whole sound subsegments of reserve part in step S03 are processed, obtain sound end by S04.

Wherein, non-speech sounds subsegment refers to mute part in voice signal outside terminal point information and background noise portions.The present invention uses speech enhancement technique to obtain clean speech with certain threshold value, filtering is asked for the envelope signal of voice signal and is core technology by the sound subsegment cluster of unsupervised clustering, can when there is ground unrest filtering noise well, obtain sound end from original voice signal.

Step S01 obtains clean speech signal step: primary speech signal is carried out speech enhan-cement and obtains contrast signal, signal to noise ratio (S/N ratio) is calculated by contrast signal and primary speech signal, if signal to noise ratio (S/N ratio) is greater than setting threshold value, then using primary speech signal as clean speech signal; If signal to noise ratio (S/N ratio) is less than setting threshold value, then using contrast signal as clean speech signal.

When having powerful connections noise, then the effect of follow-up sound subsegment cluster can be deteriorated, and the effect of speech terminals detection can significantly reduce, and whether therefore need to precheck current speech signal is clean speech signal.

The method of carrying out speech enhan-cement to primary speech signal is the one in following method: the Minimum Mean Squared Error estimation method of maximum a-posteriori estimation method, Kalman filtering method, comb filter method, Wiener filtering, spectrum-subtraction, short-time spectrum amplitude, self-adaptive routing, hidden Markov model method, wavelet transformation, neural network, auditory masking and fractal theory.Utilize the Minimum Mean Squared Error estimation method of maximum a-posteriori estimation method, Kalman filtering method, comb filter method, Wiener filtering, spectrum-subtraction, short-time spectrum amplitude, self-adaptive routing, hidden Markov model method, wavelet transformation, neural network, auditory masking, fractal theory to carry out speech enhan-cement for prior art, those skilled in the art utilize these methods all can carry out speech enhan-cement to primary speech signal.

The method obtaining the envelope signal of clean speech signal in step S02 is IIR filtering, hilbert conversion or Analytical Wavelet transform.Those skilled in the art all can obtain the envelope signal of clean speech signal by these methods.

To envelope signal maximizing and minimal value in step S02, according to minimal value, envelope signal being divided into sound subsegment, be a sound subsegment, and maximum value position is the crest of sound subsegment between two adjacent minimal value positions.Each sound subsegment and this wave configuration feature of sound subsegment crest amplitude can be obtained by asking the maximum value of envelope signal and minimal value.

The crest amplitude of the subsegment that selects a sound in step S03 is as wave configuration feature, and after sound subsegment is pressed the descending sort of crest amplitude, again sound subsegment is carried out cluster, the sound subsegment of the last part in cluster result as non-voice is removed, retain remainder.Wave configuration feature comprises crest amplitude, wave band amplitude average, wave band area, crest factor, as preferably, using crest amplitude as the wave configuration feature being used for cluster.By the descending sort of crest amplitude and after carrying out cluster, the various piece of gained cluster result presses the descending sort of crest amplitude, therefore last part in cluster result is removed as non-speech sounds subsegment, retain part above.

Level aggregating algorithm, K-means algorithm, K-modes algorithm, fuzzy clustering algorithm, graph-theoretical algorithm, based on the clustering algorithm of grid and density and ACODF algorithm.Those skilled in the art all can carry out cluster to sound subsegment by these methods.。

In step S04, disposal route is by whole sound subsegments of reserve part in step S03 according to time sequence, and sound subsegment adjacent time inter being less than threshold value couples together, and obtains sound end.Wherein, in the various piece of sound subsegment being carried out to cluster acquired results according to wave configuration feature, the time sequencing of sound subsegment there occurs change, therefore after cluster terminates, sound subsegments all in cluster result is resequenced according to time order and function order, connect sound subsegment according to the threshold value of sound subsegment adjacent time inter again, the independently sound clip obtained is sound end.

In step S04, the threshold range of adjacent time inter is 0.08s to 0.3s.This threshold value determines according to the word speed of ordinary people, and the word speed of a people is with 200 words per minute clocks for the upper limit, then the time of individual character is 0.3s, therefore with this for the upper limit setting threshold value.

The present invention uses relatively simple unsupervised clustering just to obtain good result quickly and accurately when using single features.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the end-point detection of one embodiment of the invention;

Fig. 2 is the detailed step of the sound subsegment cluster of this embodiment of the invention;

Fig. 3 is the primary speech signal figure of this embodiment of the invention;

Fig. 4 is the display effect of terminal point information in envelope signal of this embodiment of the invention, only illustrates data in end points in figure;

Fig. 5 is the display effect of terminal point information in primitive tone signal of this embodiment of the invention, only illustrates data in end points in figure.

Embodiment

Below in conjunction with drawings and Examples, be that the end-point detection of five parts is described in detail to the present invention is based on wave configuration feature cluster.

The experimental data of present example of the present invention is the telephone data of train partly and in test part in the male of 2004,2006 and 2008 of NISTSpeakerRecognitionEvaluation evaluation and test, in train in 2004, telephone packet contains 248 voice, and in test, telephone packet contains 1606 voice; In 06 year train, telephone packet contains 354 voice; And contain 648 voice in telephone data in the train of 08 year.NIST provides the end points text message that in 04 year and 06 year, whole speech data is correct, therefore may be used for detecting error rate of the present invention.Represent the telephone data in male in train part with male_train_telephone below, represent the telephone data in male in test part with male_test_telephone.The form of speech data is 8000Hz sampling rate, and l6 position quantizes, single pass WAV.The environment of experiment is matlab2012.

Fig. 1 is an embodiment of the sound end detecting method that the present invention is based on wave configuration feature cluster, and step is as follows:

Step S01, obtains clean speech signal.

When having powerful connections noise, then the effect of follow-up sound subsegment cluster can be deteriorated, and the effect of end-point detection can significantly reduce, and whether therefore need to precheck current speech signal is pure voice signal.Primary speech signal is carried out the signal signal as a comparison after speech enhan-cement, the method that primary speech signal carries out speech enhan-cement is the one in following method: the Minimum Mean Squared Error estimation method of maximum a-posteriori estimation method, Kalman filtering method, comb filter method, Wiener filtering, spectrum-subtraction, short-time spectrum amplitude, self-adaptive routing, hidden Markov model method, wavelet transformation, neural network, auditory masking and fractal theory.The present embodiment adopts and carries out Wiener filtering to carry out the speech enhan-cement of primary speech signal to primary speech signal.When primary speech signal input S filter, Wiener filtering is adopted as far as possible accurately to be showed by the pure voice signal not containing background noise, therefore Wiener filtering is adopted preferably to carry out speech enhan-cement, calculate the signal to noise ratio (snr) of sound afterwards, if SNR is more than or equal to certain threshold value as preferably, then think that present sound signals is pure voice signal, otherwise using filtered data as pure voice signal.The threshold value of SNR is by calculate and the SNR of audio files determines under observing a large amount of session operational scenarios, the expression noise-free signal (clean speech signal) of SNR more than 9.2, less than 9.2 for there being noise signal.

1-1, NIST is calculated in the male_train/test_telephone data of 2004,2006,2008 to the SNR of each speech data respectively, concrete computation process is as follows:

1. to initial primary speech signal x _i(i=1,2 ... M) do Wiener filtering, obtain filtering signal (i.e. contrast signal) sy _i(i=1,2 ... m), wherein M is primary speech signal x _i(i=1,2 ... M) length, m is Wiener filtering signal sy _i(i=1,2 ... m) length, subscript i represents i-th sampled point,

The transport function expression formula of S filter is as follows:

G (k) = {\frac{E {{| S (k) |}^{2}}}{E {{| S (k) |}^{2}} + βE {{| W (k) |}^{2}})}}^{&PartialD;}

Wherein, the Fourier transform that S (k) is clean speech signal, the Fourier transform that W (k) is additive noise signal, k represents a kth frequency, symbol E{} represents mathematical expectation, when implementing Wiener filtering to noisy speech signal, by adjustment with the value of β, obtain better filter effect.The parameter of Wiener filtering is determined according to minimum mean square error criterion.

2. with sy _i(i=1,2 ... m) length intercepts initial primary speech signal x _i(i=1,2 ... M) the 1st amplitude to m sampled point is as the amplitude sx of primary speech signal _i(i=1,2 ... m), namely inside step afterwards, x _i(i=1,2 ... m) as primary speech signal,

{sx}_{i} = \{\begin{matrix} x_{i} (i = 1,2, . . . M), M = m \\ x_{i} (i = 1,2, . . . m), M > m \end{matrix}

i=1,2,…m

3. calculate the SNR of speech data, expression formula is as follows:

SNR = 10 * \log_{10} (Σ_{i = 1}^{i = m} {sx}_{i}^{2} / Σ_{i = 1}^{i = m} {({sx}_{i} - {sy}_{i})}^{2})

The value of 1-2, SNR more than 9.2 in, voice signal can ensure as being clean speech signal, but does not represent SNR and less than 9.2 one be decided to be impure voice.

{sz}_{i} = \{\begin{matrix} {sy}_{i,} &PartialD; < 9.2 \\ {sx}_{i}, &PartialD; &GreaterEqual; 9.2 \end{matrix}

i=1,2,…m

Wherein sz _i(i=1,2 ... m) clean speech signal for obtaining.

Step S02, obtains sound subsegment.

Ask for envelope signal to the pure voice signal obtained, the method asking for envelope signal can be IIR filtering, hilbert conversion or Analytical Wavelet transform.IIR filtering speed is fast, and the entirety that the envelope signal of taking-up more adequately reflects primary speech signal moves towards trend, and therefore the present embodiment adopts IIR filtering to ask for envelope, first by the amplitude sz of voice signal _i(i=1,2 ... m) all take absolute value and obtain y, then construct IIR filter (IIRFilter), the envelope signal that filtering obtains signal is carried out to y.Then envelope signal being asked to the maximum value minimal value of signal, be a sound subsegment, and maximum value position is the crest of sound subsegment, envelope signal is divided into sound subsegment between two adjacent minimal value positions.

2-1, asks for envelope signal to the clean speech signal that step 1-1 provides.

1. take absolute value to clean speech signal, expression formula is as follows:

{sw}_{i} = \{\begin{matrix} {sz}_{i}, & {sz}_{i} &GreaterEqual; 0 \\ - {sz}_{i}, & {sz}_{i} < 0 \end{matrix},

i=1,2,…m

Sz _i(i=1,2 ... m) be the pure voice signal that step 1-1 obtains, sw _i(i=1,2 ... m) be to sz _i(i=1,2 ... m) take absolute value obtained data

2. filter design function butter is used to construct the parameter of filter function filter, according to filter order n and cutoff frequency W _n, calculate ButterWorth wave filter numerator coefficients b _i(i=1,2 ... and denominator coefficients a n+1) _i(i=1,2 ... n+1), as preferably, n=3, W _n=10/f _n, f _n=f _s/ 2, wherein, f _nfor nyquist frequency, i.e. data sampling frequency f _shalf;

3. use filter function filter to sz _i(i=1,2 ... m) carry out filtering, obtain envelope signal so _i(i=1,2 ... m), it is obtained by difference equation below,

a ₁*so _n=b ₁*sz _n+b ₂*sz _n-1+…+b _nb+1*sz _n-nb-a ₂

*so _n-1-…-a _na+1*so _n-na

n=1,2,…，m

Wherein, na=nb=n, represents the exponent number of wave filter, { a ₁, a ₂... a _na+1represent that difference equation exports so _i(i=1,2 ... m) coefficient, if a ₁be not 1, function f ilter its specification can be turned to 1, { b ₁, b ₂... b _nb+1represent input sz _i(i=1,2 ... m) coefficient.

2-2, obtains the amplitude at sound subsegment and corresponding crest place.

1. to envelope signal so of voice _i(i=1,2 ... m) ask first difference, obtain expression formula as follows:

f1 _i=so _i+1-so _i

i=1,2,…m-1

2. to f1 _i(i=1,2 ... m-1) ask first difference again, obtain expression formula as follows:

f2 _j=f1 _j+1-f1 _j

j=1,2,…m-2

3. by f2 _j(j=1,2 ... m-2) add that 0 obtains f3 end to end respectively _k(k=1,2 ... m),

{f 3}_{k} = \{\begin{matrix} 0, k = 1 \\ {f 2}_{k - 1} \\ 0, k = m \end{matrix}, k = 2,3, . . . m - 1

k=1,2,…m

4. f3 _k(k=1,2 ... m) intermediate value be-2 position be the position at maximum value place in envelope signal, value be 2 position be then the position at minimal value place in envelope signal,

5. with the initial sum final position that two adjacent minimal values are sound subsegment, the maximum value between them is then the crest of this sound subsegment.

Step S03, carries out cluster according to crest amplitude by sound subsegment, and in cluster result, remove the sound subsegment of non-voice, retains remainder.

As shown in Figure 2, concrete steps are as follows:

1. cluster is carried out by sound subsegment morphological feature.As preferably, select crest amplitude as the wave configuration feature of sound subsegment, first press these sound subsegments of crest amplitude descending sort, obtain sample sample to be clustered.

2. to have levels aggregating algorithm, K-means algorithm, K-modes algorithm, fuzzy clustering algorithm, graph-theoretical algorithm, clustering algorithm, ACODF algorithm etc. based on grid and density by the method that sample to be clustered to be carried out cluster by unsupervised clustering.K-means algorithm is simple and computing velocity is very fast, being applied in speech terminals detection can speed up processing, as preferably, adopt K-means method, be five parts (namely carrying out five points of clusters) by sample cluster, according to the order of crest amplitude descending sort, five classes are respectively class1, class2, class3, class4, class5, and num_class1, num_class2, num_class3, num_class4 is class1 respectively, class2, class3, the number of sound subsegment in class4, calculate its summation: total_num=num_class1+num_class2+num_class3+num_class4.

Retain front four classes of five points of clusters, i.e. front total_num the sample of sample, the total length calculating these sound subsegments obtains the result time_K-means_five_interval_1 of first time five points of clusters,

time_K - means_result_five_1

= Σ_{i = 1}^{i = num_class 1} L_{interval} (i) + Σ_{i = 1}^{i = num_class 2} L_{interval} (i)

+ Σ_{i = 1}^{i = num_class 3} L_{interval} (i) + Σ_{i = 1}^{i = num_class 4} L_{interval} (i)

Wherein L _intervali () represents the length of each the sound subsegment in the result of five points of clusters in front four classes.

The object of five points of clusters rejects non-speech sounds subsegment exactly, in conjunction with the actual displayed characteristic (class that crest amplitude is minimum is very likely non-speech sounds subsegment) of sound subsegment crest amplitude, should retain the part that classification results medium wave peak amplitude is higher.Owing to having carried out the descending sort of crest amplitude before cluster, the multiple parts therefore after cluster are equally according to the descending sort of crest amplitude, and the minimum part of crest amplitude is in finally.Learn through experiment, front four classes of getting cluster are that reserve part is best.Same, the result of four points of clusters and three points of clusters only retains first three class and front two classes all respectively.

If time_K-means_five_interval_1 is less than the certain proportion (in the embodiment of the present invention, as preferably, this ratio is 60%) of whole sound subsegment total length, then using first time five points of cluster results as final cluster result.Unsupervised clustering has instability, just obtains final cluster result when the result of 95% is all first time five points of clusters in the middle of experiment.

If 3. time_K-means_five_interval_1 is greater than the certain proportion of whole sound subsegment total length (in the embodiment of the present invention, as preferably, this ratio is 60%), just in kind carry out five points of clusters again, the sound subsegment total length obtaining reserve part is time_K-means_five_interval_2, if time_K-means_five_interval_2 is less than above-mentioned 60%, then using this time five points of clusters as cluster result.Otherwise situation less in the result of twice five points of clusters, as the result time_K-means_five_interval of five points of clusters, carries out FOUR EASY STEPS.

4. it is in kind four parts by sound subsegment cluster, and retain first three class of cluster result, length time_K-means_four_interval, if (time_K-means_five_interval-time_K-means_four_interval) <time_K-means_five_interval*0.15, then the result be the result of five parts with cluster being step 3; Otherwise be three parts by sound subsegment cluster, wherein the sound subsegment length of front two classes is time_K-means_three_interval, if (time_K-means_four_interval-time_K-means_three_interval) <time_K-means_five_interval*0.2, then using cluster be the result of four parts as cluster result, and retain first three class; Otherwise using the result of three points of clusters as cluster result, and retain front two classes.

Step S04, the sound subsegment for the treatment of step 03 reserve part, obtains sound end.

After sound subsegment being sorted according to crest amplitude and carry out cluster, the time sequencing of sound subsegment there occurs change, therefore after cluster terminates, to sound subsegments all in cluster result according to the rearrangement of time order and function order, then connect sound according to the time interval between each sound subsegment.Described threshold value determines according to the word speed of ordinary people, and the word speed of a people is with 200 words per minute clocks for the upper limit, then the time of individual character is 0.3s, with this for the upper limit sets a threshold value.In order to the continuity of voice, threshold value is set to 0.1s by the present embodiment, and sound subsegment adjacent time inter being less than 0.1s couples together and just obtains final terminal point information.

The working time of the inventive method and the length of primary voice data have much relations, the data length of two data sets tested here is all be about 3 minutes, with the crest amplitude of waveform for feature, the working time calculated and result are as table 1, and wherein MINE mono-hurdle corresponding data is for adopting the inventive method speech terminals detection result of carrying out.The feature of simultaneously carrying out testing also has the combination of wave band amplitude average, wave band area and crest factor and these five features, as preferably, choose the crest amplitude of waveform as wave configuration feature, as a comparison be, article APRACTICAL, the VQVAD method that in SELF-ADAPTIVEVOICEACTIVITYDETECTORFORSPEAKERVERIFICATION WITHNOISYTELEPHONEEPHONEANDMICROPHONEDATA, TomiKinnunen and PadmanabhanRajan proposes, and the method for energy measuring in Open Source Platform ALIZE.

The computing platform of experiment is PC, Corei3-21303.3GHz processor and 8GBDDR3 internal memory.In three steps, sound enhancement method occupies more than 90% of the processing time, and can not do this operation in time needing the voice asking for terminal point information can determine to be clean speech, at this moment the processing time of each voice will within 0.5s.

Can see from form, the sound end detecting method utilizing the present invention to carry out has processing speed faster, and decreases error rate.

Claims

1. based on a sound end detecting method for wave configuration feature cluster, it is characterized in that, comprise the steps:

S01, obtains clean speech signal by primary speech signal;

S02, obtain the envelope signal of described clean speech signal, to envelope signal maximizing and minimal value, according to minimal value, envelope signal is divided into sound subsegment, be a sound subsegment between two adjacent minimal value positions, and maximum value position is the crest of sound subsegment;

S04, by whole sound subsegments of reserve part in step S03 according to time sequence, and sound subsegment adjacent time inter being less than threshold value couples together, and obtains sound end.

2. as claimed in claim 1 based on the sound end detecting method of wave configuration feature cluster, it is characterized in that, step S01 obtains clean speech signal step: primary speech signal is carried out speech enhan-cement and obtains contrast signal, signal to noise ratio (S/N ratio) is calculated by contrast signal and primary speech signal, if signal to noise ratio (S/N ratio) is greater than setting threshold value, then using primary speech signal as clean speech signal; If signal to noise ratio (S/N ratio) is less than setting threshold value, then using contrast signal as clean speech signal.

3. as claimed in claim 2 based on the sound end detecting method of wave configuration feature cluster, it is characterized in that, the method for carrying out speech enhan-cement to primary speech signal is the one in following method: the Minimum Mean Squared Error estimation method of maximum a-posteriori estimation method, Kalman filtering method, comb filter method, Wiener filtering, spectrum-subtraction, short-time spectrum amplitude, self-adaptive routing, hidden Markov model method, wavelet transformation, neural network, auditory masking and fractal theory.

4. as claimed in claim 1 based on the sound end detecting method of wave configuration feature cluster, it is characterized in that, the method obtaining the envelope signal of clean speech signal in step S02 is IIR filtering, conversion or Analytical Wavelet transform.

5. as claimed in claim 1 based on the sound end detecting method of wave configuration feature cluster, it is characterized in that, the crest amplitude of the subsegment that selects a sound in step S03 is as wave configuration feature, and after sound subsegment is pressed the descending sort of crest amplitude, again sound subsegment is carried out cluster, the sound subsegment of last part in cluster result as non-voice is removed, retains remainder.

6. as claimed in claim 1 based on the sound end detecting method of wave configuration feature cluster, it is characterized in that, in step S03, the clustering algorithm of sound subsegment is the one in following algorithm: level aggregating algorithm, K-means algorithm, algorithm, fuzzy clustering algorithm, graph-theoretical algorithm, based on the clustering algorithm of grid and density and algorithm.

7. as claimed in claim 1 based on the sound end detecting method of wave configuration feature cluster, it is characterized in that, in step S04, the threshold range of adjacent time inter is 0.08s to 0.3s.