CN102760444B

CN102760444B - Support vector machine based classification method of base-band time-domain voice-frequency signal

Info

Publication number: CN102760444B
Application number: CN201210125085.7A
Authority: CN
Inventors: 刘一民; 李元新; 孟华东
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2012-04-25
Filing date: 2012-04-25
Publication date: 2014-06-11
Anticipated expiration: 2032-04-25
Also published as: CN102760444A

Abstract

The invention relates to a support vector machine based classification method of base-band time-domain voice-frequency signals, comprising the following steps of: firstly segmenting a base-band time-domain voice-frequency signal sequence to obtain initial segmented subsequences; then respectively subtracting respective mean value from each initial segmented subsequence to obtain zero-mean-value segmented subsequences; then carrying out windowing treatment on each zero-mean-value segmented subsequence, respectively carrying out Fourier transformation treatment on results to obtain the spectrum amplitudes of the zero-mean-value segmented subsequences, and respectively solving the standard difference of each spectrum amplitude to obtain a characteristic quantity; sequentially combining the zero-mean-value segmented subsequences into a long subsequence according to an order; then calculating a normalized autocorrelation matrix of the long subsequence, and carrying out singular value decomposition on the normalized autocorrelation matrix to obtain a demarcation point of a subspace; then calculating the signal to noise ratio parameter of an other characteristic quantity; and finally sending an input vector composed of the two characteristic quantities into a trained SVM (Support Vector Machine) classifier to identify the classification of base-band time-domain voice-frequency signals and distinguish a voice signal and a noise signal.

Description

Base band time-domain audio signal sorting technique based on Support Vector Machine

Technical field

The invention belongs to signal processing technology field, be specifically related to a kind of base band time-domain audio signal sorting technique based on Support Vector Machine.

Background technology

The present invention is applied in radio detection system, handled signal is the base band time-domain audio signal after demodulation, the voice signal that signal may be polluted in various degree by noise, also may be pure noise signal, wherein noise all with white noise for leading and be mixed with a small amount of coloured noise, utilize the principle of SVM to build a kind of sorter, signal type is carried out to simple and effective discriminating classification.

Following article and patent documentation, covered the main background technology in this field substantially.In order to explain out the evolution of technology, allow their time sequencings arrange, and introduce one by one the main contributions of document.

1．S.Gokhun?Tanyer,Hamza?ozer,“Voice?Activity?Detection?in?Nonstationary?Gaussian?Noise”,Proceedings?of?ICSP,1620-1623,1998.

Sound end detects (Voice Activity Detection, VAD) refer to the process that screens out voice from noise, article has proposed the method for energy threshold method, zero-crossing rate method, least square cycle estimator and adaptive energy thresholding, wherein energy threshold method and zero-crossing rate method are applicable to Signal-to-Noise (signal to noi se ratio more, SNR) in higher situation, in the time that signal to noise ratio (S/N ratio) is lower, false-alarm is very high, and least square cycle estimator can cause detecting unsuccessfully because noise non-stationary comprises periodically.Article also proposes several different methods to be integrated into the strategy of row Speech signal detection simultaneously.

2．C.J.C.Burges,“A?Tutorial?on?Support?Vector?Machines?for?Pattern?Recognition”，Data?Mining?and?Knowledge?Discovery,vol.2,no.2,pp.121-167,1998.

The ultimate principle and the conclusion that describe SVM in detail are derived, the method of SVM is that the optimal classification lineoid from linear separability situation puts forward, first its basic thought may be summarized to be, by nonlinear transformation, the input space is transformed to a higher dimensional space, then in this new space, asks for highest possible priority classification lineoid." largest interval " and " by extremely more higher dimensional space of data projection " is its key concept, and SVM forms two quasi-mode sorters on ordinary meaning.But mostly this article is SVM ultimate principle to carry out the proof of the derivation of equation, be not given in prompting and the guidance of the application of Speech signal detection aspect.

3．S.Gokhun?Tanyer,Hamza?ozer,“Voice?Activity?Detection?in?Nonstationary?Noise”,IEEE?Trans.Speech?Audio?Process.,vol.8,no.4,pp.478-481,Jul.2001

Propose the sound end detecting method of adaptive energy thresholding and provide implementation strategy, being wherein applied to method of geometry and calculating signal SNR, having reduced the dependence to noise signal prior imformation.But the method for estimation of this SNR is subject to the impact of signal cumulative distribution, can not fully learn noise signal information, comparatively difficulty is chosen and adjusted to parameter, and the in the situation that of noise non-stationary, SNR estimates at deviation.

4．Quanwei?Cai,Ping?Wei,Xianci?Xiao,“A?Digital?Modulation?Recognition?Method”,Proceedings?of?ICASSP,2004,pp?863–866

Signal SNR estimation principle and method based on SVD have been proposed, simple, the performance of the method is not inquired into, do not provide the choosing method of calculating parameter yet.

5．Cheol-Sun?Park,Won?Jang,Sun-Phil?Nah.and?Dae?Young?Kim,“Automatic?Modulation?Recognition?using?Support?Vector?Machine?in?Software?Radio?Applications”，in?Proc.9th?IEEE?ICACT,Feb.2007,pp.9-12

The method of signal Modulation Mode Recognition based on SVM is proposed, with the power spectrum density maximal value γ of the normalization Central Symmetry instantaneous amplitude of signal _max, Central Symmetry nonlinear component absolute value in signal strong component instantaneous phase standard deviation sigma _ap, Central Symmetry nonlinear component in signal strong component instantaneous phase standard deviation sigma _dp, receive the standard deviation sigma of normalization Central Symmetry instantaneous amplitude absolute value of signal _aaand the standard deviation sigma of normalization instantaneous frequency absolute value in signal strong component _afas characteristic quantity, input obtains result, even if also obtain classification results exactly in the situation that of the low SNR of signal.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the object of the present invention is to provide a kind of base band time-domain audio signal sorting technique based on Support Vector Machine, base band time-domain audio signal is processed, extract characteristic quantity as the input of sorter to obtain the discriminating classification results to signal type, thereby by voice signal and noise signal classification.

To achieve these goals, the technical solution used in the present invention is:

Base band time-domain audio signal sorting technique based on Support Vector Machine, comprises the steps:

Step 1: the base band time-domain audio signal sequence s={s (1) that is N by total length, s (2) ..., s (N) } be divided into K section, every segment length is L, obtains initial fragment subsequence

\{\begin{matrix} s_{1} = {s_{1} (1), s_{1} (2), . . ., s_{1} (L)} \\ s_{2} = {s_{2} (1), s_{2} (2), . . ., s_{2} (L)} \\ . . . \\ s_{K} = {s_{K} (1), s_{K} (2), . . ., s_{K} (L)} \end{matrix},

Wherein s _i(m)=s ((i-1) L+m) (i=1,2 ..., K, m=1,2 ..., L), then each initial fragment subsequence deducts respectively average separately, can obtain zero-mean segmentation subsequence

\{\begin{matrix} x_{1} = {x_{1} (1), x_{1} (2), . . ., x_{1} (L)} \\ x_{2} = {x_{2} (1), x_{2} (2), . . ., x_{2} (L)} \\ . . . \\ x_{K} = {x_{K} (1), x_{2} (K), . . ., x_{K} (L)} \end{matrix},

Wherein

x_{i} (m) = s_{i} (m) - \frac{1}{L} Σ_{j = 1}^{L} s_{i} (j);

Step 2: each zero-mean segmentation subsequence is carried out to windowing process, obtain result and be

\{\begin{matrix} {x_{1}}^{'} = x_{1} w^{T} \\ {x_{2}}^{'} = x_{2} w^{T} \\ . . . \\ {x_{K}}^{'} = x_{K} w^{T} \end{matrix},

Wherein w is Hanning window;

Step 3: the result after windowing process is carried out respectively to Fourier transform processing, and the spectrum amplitude sequence that obtains the zero-mean segmentation subsequence after each windowing is

\{\begin{matrix} f_{1} = | FFT ({x_{1}}^{'}) | = {f_{1} (1), f_{1} (2), . . ., f_{1} (M)} \\ f_{2} = | FFT ({x_{2}}^{'}) | = {f_{2} (1), f_{2} (2), . . ., f_{2} (M)} \\ . . . \\ f_{K} = | FFT ({x_{K}}^{'}) | = {f_{K} (1), f_{K} (2), . . ., f_{K} (M)} \end{matrix},

Wherein M is the length of spectrum amplitude sequence;

Step 4: obtain respectively the standard deviation d={d (1) of each spectrum amplitude, d (2) ..., d (K) }, wherein

d (i) = \sqrt{\frac{1}{M - 1} Σ_{j = 1}^{M} {(f_{i} (j) - \frac{1}{M} Σ_{l = 1}^{M} f_{i} (l))}^{2}},

Then obtain the mean value of all standard deviations, obtain a characteristic quantity of this base band time-domain audio signal sequence, i.e. spectrum amplitude standard deviation

Step 5: by each zero-mean segmentation subsequence x ₁, x ₂..., x _kbe combined into successively long sequence x, i.e. an x={x according to order ₁, x ₂..., x _k}={ x (1), x (2) ..., x (N) }, then calculate the normalized autocorrelation matrix of this sequence, result is wherein

r_{i} = \frac{1}{Σ_{j = 1}^{N} x {(j)}^{2}} Σ_{l = 1}^{N - i} x (l) x (l + i),

Q is the dimension of autocorrelation matrix, and span is [50,90];

Step 6: autocorrelation matrix R is carried out to svd, obtain R=V Λ V ^h, wherein Λ=diag (λ ₁, λ ₂..., λ _q) _{q × Q}=diag (γ ₁+ σ ²..., γ _p+ σ ², σ ²..., σ ²) _{q × Q}, and γ ₁>=γ ₂>=...>=γ _pthereby, obtain the separation p of subspace;

Step 7: according to

\hat{σ^{2}} = \frac{1}{Q - p} Σ_{i = p + 1}^{Q} λ_{i},

\hat{SNR} = 10 \log \frac{Σ_{i = 1}^{p} λ_{i} - p \times \hat{σ^{2}}}{Q \times \hat{σ^{2}}}

Another characteristic quantity that calculates this base band time-domain audio signal sequence, is designated as signal to noise ratio (S/N ratio) parameter

Step 8: by two characteristic quantities of this base band time-domain audio signal sequence, i.e. spectrum amplitude standard deviation D and signal to noise ratio (S/N ratio) parameter

formation input vector, sends in the svm classifier device of having trained, thereby identifies the kind of this base band time-domain audio signal, distinguishes voice signal and noise signal.

Above-mentioned subspace separation p can obtain by the following method: by

wherein

be to the rounding downwards of autocorrelation matrix dimension result, calculate a last T+1 eigenvalue λ _q-T, λ _q-T+1..., λ _qaverage, then all 1.5E that are greater than _λeigenwert in be designated as p, i.e. p={i| λ under maximum _i>1.5E _λ, λ _i+1<1.5E _λ.

The above-mentioned base band time-domain audio signal sequence s={s (1) that is N by total length, s (2) ..., s (N) } be divided into K section, every corresponding period should be not more than 20ms.

Compared with prior art, the present invention more obtains the prior imformation of signal to be sorted by the mode of training, choose suitable input feature vector amount and can effectively obtain rapidly classification results.In order to reflect the difference of voice signal and noise signal, select signal SNR parameter and the signal spectrum amplitude standard deviation input feature vector amount as sorter, not only facilitate calculating but also can be good at realizing discriminating and the classification of signal.The present invention is detection and identification voice signal and noise signal effectively, two input feature vector amount Signal-to-Noise parameters choosing and signal spectrum amplitude standard deviation are calculated the difference that simply can effectively reflect again two kinds of signals, even if also can guarantee higher classification accuracy rate in the situation that signal to noise ratio (S/N ratio) is lower.The present invention is applicable to real time signal processing, is easy to realize, and can perform well in radio application.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

Fig. 2 is the probability density distribution figure of input feature vector amount while being Signal-to-Noise parameter.

Fig. 3 is the probability density distribution figure of input feature vector amount while being spectrum amplitude standard deviation.

Fig. 4 is svm classifier device working result schematic diagram.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further details.

The present invention is based on SVM principle design sorter, by base band time-domain audio signal series processing is extracted to characteristic quantity, set it as input and send into the complete sorter of training, thereby identify the type of sound signal, voice signal and noise signal are correctly classified.

As shown in Figure 1, performing step is as follows:

Step 1: due to be processed be the base band time-domain audio signal sequence of having passed through demodulation, first tackle signal and carry out pre-service, so that extract the characteristic quantity of abundant reflected signal feature.

The base band time-domain audio signal sequence s={s (1) that is N by total length, s (2) ..., s (N) } be evenly divided into K section, every segment length is L, every corresponding period should be not more than 20ms.

Obtain initial fragment subsequence

\{\begin{matrix} s_{1} = {s_{1} (1), s_{1} (2), . . ., s_{1} (L)} \\ s_{2} = {s_{2} (1), s_{2} (2), . . ., s_{2} (L)} \\ . . . \\ s_{K} = {s_{K} (1), s_{K} (2), . . ., s_{K} (L)} \end{matrix},

Wherein s _i(m)=s ((i-1) L+m) (i=1,2 ..., K, m=1,2 ..., L), then each initial fragment subsequence deducts respectively average separately to remove DC component, thereby can obtain zero-mean segmentation subsequence

\{\begin{matrix} x_{1} = {x_{1} (1), x_{1} (2), . . ., x_{1} (L)} \\ x_{2} = {x_{2} (1), x_{2} (2), . . ., x_{2} (L)} \\ . . . \\ x_{K} = {x_{K} (1), x_{2} (K), . . ., x_{K} (L)} \end{matrix},

Wherein

x_{i} (m) = s_{i} (m) - \frac{1}{L} Σ_{j = 1}^{L} s_{i} (j) .

Step 2: the impact of secondary lobe on result while segmentation subsequence being carried out to frequency domain processing in order to reduce, select Hanning window to carry out windowing process to each zero-mean segmentation subsequence.Result after windowing is

\{\begin{matrix} {x_{1}}^{'} = x_{1} w^{T} \\ {x_{2}}^{'} = x_{2} w^{T} \\ . . . \\ {x_{K}}^{'} = x_{K} w^{T} \end{matrix},

Wherein w is Hanning window sequence.

\{\begin{matrix} f_{1} = | FFT ({x_{1}}^{'}) | = {f_{1} (1), f_{1} (2), . . ., f_{1} (M)} \\ f_{2} = | FFT ({x_{2}}^{'}) | = {f_{2} (1), f_{2} (2), . . ., f_{2} (M)} \\ . . . \\ f_{K} = | FFT ({x_{K}}^{'}) | = {f_{K} (1), f_{K} (2), . . ., f_{K} (M)} \end{matrix},

Wherein counting of FFT should be 2 the power exponent 2 that is greater than 2～4 times of sub-sequence length ^a, M is the length of spectrum amplitude sequence.

Step 4: utilize standard deviation without inclined to one side estimated form

d (i) = \sqrt{\frac{1}{M - 1} Σ_{j = 1}^{M} {(f_{i} (j) - \frac{1}{M} Σ_{l = 1}^{M} f_{i} (l))}^{2}}

Obtain respectively the standard deviation d={d (1) of the spectrum amplitude of each segmentation subsequence, d (2) ..., d (K) }, then obtain the mean value of all standard deviations, just obtain a characteristic quantity of this time-domain audio signal sequence, i.e. spectrum amplitude standard deviation

D = \frac{1}{K} Σ_{i = 1}^{K} d (i) .

As shown in Figure 2, wherein horizontal ordinate is the span of Signal-to-Noise parameter to Signal-to-Noise parameter, and ordinate is probability density; As shown in Figure 3, wherein horizontal ordinate is the span of spectrum amplitude standard deviation to the probability density function of spectrum amplitude standard deviation, and ordinate is probability density.As can be seen from the figure the characteristic quantity of noise signal distributes comparatively concentrated, therefore single characteristic quantity can reflect the difference of voice signal and noise signal to a certain extent, but two class signals can not be distinguished completely effectively, could realize correct signal classification so need to combine both as the input quantity of sorter, therefore continue to carry out following steps.

Step 5: then audio signal sequence is processed and obtained another one characteristic quantity.First by each zero-mean segmentation subsequence x ₁, x ₂..., x _kbe combined into successively a Chief Signal Boatswain sequence x according to order, obtain x={x ₁, x ₂..., x _k}={ x (1), x (2) ..., x (N) }, then calculate the normalized autocorrelation matrix of this sequence, result is

wherein

r_{i} = \frac{1}{Σ_{j = 1}^{N} x {(j)}^{2}} Σ_{l = 1}^{N - i} x (l) x (l + i),

And Q is the dimension of autocorrelation matrix, and span is [50,90], in the present invention, value is 70.

Step 6: autocorrelation matrix R is carried out to SVD decomposition, obtain R=V Λ V ^h.Suppose that voice signal and noise signal are separate, R=R _x+ R _n=V (Λ _x+ Λ _n) V ^h=V Λ V ^h, wherein R _x, R _nit is respectively the autocorrelation matrix of voice signal and noise signal.

Decompose known Λ by SVD _x=diag (γ ₁, γ ₂..., γ _p, 0 ..., 0) _{q × Q}, γ ₁>=γ ₂>=...>=γ _p,

Λ _n=diag(σ ²,σ ²,…,σ ²) _Q×Q，

Λ=diag(λ ₁,λ ₂,…,λ _Q) _Q×Q＝diag(γ ₁+σ ²,…,γ _p+σ ²,σ ²,…,σ ²) _Q×Q。

Pass through

wherein

be to the rounding downwards of autocorrelation matrix dimension result, calculate a last T+1 eigenvalue λ _q-T, λ _q-T+1..., λ _qaverage, then search all 1.5E of being greater than _λeigenwert in be designated as separation p, i.e. p={i| λ under maximum _i>1.5E _λ, λ _i+1<1.5E _λ.

Step 7: according to

\hat{σ^{2}} = \frac{1}{Q - p} Σ_{i = p + 1}^{Q} λ_{i},

\hat{SNR} = 10 \log \frac{Σ_{i = 1}^{p} λ_{i} - p \times \hat{σ^{2}}}{Q \times \hat{σ^{2}}}

Calculate another characteristic quantity of this base band time-domain audio signal sequence, i.e. signal to noise ratio (S/N ratio) parameter

can reflect to a certain extent the state of signal-to-noise of signal.

form input vector, send in the complete svm classifier device of training, just can obtain the classification results of this base band time-domain audio signal, distinguish voice signal and noise signal.

Carry out the sorter working result of this step as shown in Figure 4, wherein "+" is phonic signal character amount, " * " is noise signal characteristic quantity, in space, two category feature amounts can correctly be isolated, and confirm thus distinguishing signal type and the correctly classification effectively of this base band time-domain audio signal sorter based on SVM.

Claims

1. the base band time-domain audio signal sorting technique based on Support Vector Machine, is characterized in that, comprises the steps:

\{\begin{matrix} s_{1} = {s_{1} (1), s_{1} (2), . . ., s_{1} (L)} \\ s_{2} = {s_{2} (1), s_{2} (2), . . ., s_{2} (L)} \\ . . . \\ s_{K} = {s_{K} (1), s_{K} (2), . . ., s_{K} (L)} \end{matrix}

, wherein s _i(m)=s ((i-1) L+m) (i=1,2 ..., K, m=1,2 ..., L), then each initial fragment subsequence deducts respectively average separately, can obtain zero-mean segmentation subsequence

\{\begin{matrix} s_{1} = {s_{1} (1), s_{1} (2), . . ., s_{1} (L)} \\ s_{2} = {s_{2} (1), s_{2} (2), . . ., s_{2} (L)} \\ . . . \\ s_{K} = {s_{K} (1), s_{K} (2), . . ., s_{K} (L)} \end{matrix}

, wherein

x_{i} (m) = s_{i} (m) - \frac{1}{L} Σ_{j = 1}^{l} s_{i} (j);

\{\begin{matrix} {x_{1}}^{'} = x_{1} w^{T} \\ {x_{2}}^{'} = x_{2} w^{T} \\ . . . \\ {x_{K}}^{'} = x_{K} w^{T} \end{matrix}

, wherein w is Hanning window;

\{\begin{matrix} f_{1} = | FFT ({x_{1}}^{'}) | = {f_{1} (1), f_{1} (2), . . ., f_{1} (M)} \\ f_{2} = | FFT ({x_{1}}^{'}) | = {f_{2} (1), f_{2} (2), . . ., f_{2} (M)} \\ . . . \\ f_{K} = | FFT ({x_{K}}^{'}) | = {f_{K} (1), f_{K} (2), . . ., f_{K} (M)} \end{matrix},

Wherein M is the length of spectrum amplitude sequence;

, then obtain the mean value of all standard deviations, obtain a characteristic quantity of this base band time-domain audio signal sequence, i.e. spectrum amplitude standard deviation

Step 5: by each zero-mean segmentation subsequence x ₁, x ₂..., x _kbe combined into successively long sequence x, i.e. an x={x according to order ₁, x ₂..., x _k}={ x (1), x (2) ..., x (N) }, then calculate the normalized autocorrelation matrix of this sequence, result is

, wherein

r_{i} = \frac{1}{Σ_{j = 1}^{N} {x (j)}^{2}} Σ_{l = 1}^{N - i} x (l) x (l + i)

, Q is the dimension of autocorrelation matrix, span is [50,90];

Step 6: autocorrelation matrix R is carried out to svd, obtain R=V Λ V ^h, wherein Λ=diag (λ ₁, λ ₂..., λ _q) _{q × Q}=diag (γ ₁+ σ ²..., γ _p+ σ ², σ ²..., σ ²) _{q × Q}, and γ ₁>=γ ₂>=...>=γ _p, by

, wherein

be to the rounding downwards of autocorrelation matrix dimension result, calculate a last T+1 eigenvalue λ _q-T, λ _q-T+1..., λ _qaverage, then all 1.5E that are greater than _λeigenwert in be designated as p, i.e. p={i| λ under maximum _i>1.5E _λ, λ _i+1<1.5E _λ, thereby obtain the separation p of subspace;

Step 7: according to

\hat{σ^{2}} \frac{1}{Q - p} Σ_{i = p + 1}^{Q} λ_{i},

\hat{SNR} = 10 \log \frac{Σ_{i = 1}^{p} λ_{i} - p \times \hat{σ_{2}}}{Q \times \hat{σ^{2}}}

, formation input vector, sends in the svm classifier device of having trained, thereby identifies the kind of this base band time-domain audio signal, distinguishes voice signal and noise signal.

2. signal sorting technique according to claim 1, is characterized in that, in described step 1, be divided into K section, every corresponding period is not more than 20ms.