WO2008067719A1

WO2008067719A1 - Sound activity detecting method and sound activity detecting device

Info

Publication number: WO2008067719A1
Application number: PCT/CN2007/003364
Authority: WO
Inventors: Qin Yan; Haojiang Deng; Jun Wang; Xuewen Zeng; Jun Zhang; Libin Zhang
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2006-12-07
Filing date: 2007-11-28
Publication date: 2008-06-12
Also published as: CN101197130B; CN101197130A

Abstract

A sound activity detecting method and sound activity detecting device, the core of the method and device is as follows: when the sound activity needs to be detected, the feature parameters of the current signal frame is extracted, the sound class of the current signal frame is determined according to the feature parameter and the setting parameter threshold value.

Description

Sound activity detection method and sound activity detector

The present invention relates to speech signal processing techniques, and more particularly to a voice activity detection method and a voice activity detector. Background technique

In the field of speech signal processing, there is a technology for detecting voice activity. When it is applied in speech coding technology, it is usually called Voice Activity Detection (VAD). When it is applied in speech recognition technology, It is commonly referred to as Speech Endpoint Detection, and when it is used in speech enhancement technology, it is commonly referred to as Speech Pause Detection. These technologies have different focuses for different application scenarios and can produce different processing results. However, their essence is to detect the presence of voice during voice communication. The accuracy of the detection results directly affects the quality of subsequent processing, such as speech coding, speech recognition and speech enhancement.

Voice activity detection technology is primarily developed for speech signals input into the encoder. In speech coding technology, the audio signals input into the encoder are divided into two types: background noise and active speech, and then the background noise and the active speech are encoded at different rates, that is, the background noise is used at a lower rate. Coding, encoding the active speech at a higher rate, thereby reducing the average bit rate of communication and promoting the development of variable rate speech coding technology. However, with the development of coding technology to multi-code rate and wideband direction, the signal input to the encoder is diversified, that is, it is not limited to speech, but also includes music and various noises. Therefore, before encoding the input signal, it is necessary to Different input signals are differentiated so that different code rates can be used, and even different core coding algorithms are used to encode different input signals.

The prior art related to the present invention is a multi-rate coding standard broadband adaptation developed by the 3rd Generation Partnership Project (3GPP) for the third generation mobile communication system. Multi-rate speech coder ( Adaptive Multi-Rate - Wideband, AMR-WB+), which has two core coding algorithms: Algebraic Code Excited Linear Prediction (ACELP) and Transform Coded Excited (TCX) mode. The ACELP mode is suitable for speech signal coding. The TCX is suitable for wideband signals containing music, so the choice of the two modes can be considered as the choice of voice and music. The mode selection methods of ACELP and TCX in coding algorithm are open-loop and closed-loop. Closed-loop selection is a selection method based on perceptually weighted SNR for traversal search, which is independent of VAD module. Open-loop selection is based on AMR. Based on the VAD module of the WB+ encoding algorithm, the short-term and long-term statistics of the feature parameters are added, and the non-speech features are improved, and the classification of speech and music can be realized to a certain extent; and when ACELP is continuously selected When the number of modes is less than three times, a small-scale traversal search is still performed, and since the feature parameters used in the classification are obtained by the coding algorithm, the coupling of the method with the AMR-WB+ coding algorithm is very close.

The second prior art related to the present invention is a multi-rate mode voice coding standard (SMV) developed by the third generation Mobile Communication Standardization Partnership Project 2 (3GPP2) for the CDMA2000 system. It has four encoding rates to choose from, 9.6, 4.8, 2.4, and 1.2 kbps (actual net rate of 8.55, 4.0, 2.0, and 0.8 kbps) to support mobile operators between system capacity and voice quality. Flexible choice, the algorithm contains a music detection module. The module further calculates the parameters required for the music detection by using the partial parameters calculated by the VAD module, and executes after the VAD detection, supplements the judgment according to the output decision of the VAD module, and the calculated parameters required for the music detection, and outputs the music and The result of non-musical classification is therefore very close to the coding algorithm.

It can be seen from the prior art that the prior art detects the music signal based on the VAD technology in the existing speech coding standard, and thus is closely related to the encoding algorithm, that is, the coupling with the encoder itself is too large, and independence. Generality and maintainability are generally poor, and the cost of porting between coding is high.

In addition, the existing VAD algorithms are developed for speech signals, so only the input audio signals are divided into two types: noise and speech (non-noise), even if the detection of the music signal is included, it is only one of the VAD decisions. Amend and supplement. Therefore, as the codec algorithm application scenario gradually transitions from processing speech to processing multimedia speech (including multimedia music;), the codec algorithm itself is gradually narrower. Bringing to broadband extensions, so as the application scenario changes, the simple output categories of existing VAD algorithms are clearly insufficient to describe a wide variety of audio signal characteristics. Summary of the invention

Embodiments of the present invention provide a voice activity detecting method and a voice activity detector that are capable of independently extracting feature parameters of a signal from an encoding algorithm and using the extracted feature parameters to determine a sound category to which the input signal frame belongs.

Embodiments of the present invention are implemented by the following technical solutions:

An embodiment of the present invention provides a voice activity detecting method, including:

Extracting characteristic parameters in the current signal frame when sound activity detection is required;

Determining, according to the characteristic parameter and the set parameter threshold, the sound category to which the current signal frame belongs.

An embodiment of the present invention also provides a sound activity detector, including:

a feature parameter extraction module, configured to extract feature parameters in a current signal frame when sound activity detection is required;

And a signal class determining module, configured to determine, according to the feature parameter and the set parameter threshold, a sound category to which the current signal frame belongs.

As can be seen from the specific embodiments provided by the present invention described above, the embodiment of the present invention extracts the feature parameters used in the process of determining the sound category to which the input signal frame belongs when the sound activity detection is required, and thus does not depend on A specific coding algorithm is performed independently, which facilitates maintenance and update. DRAWINGS

Figure 1 is a structural view of a first embodiment of the present invention;

2 is a schematic diagram of the operation of the signal pre-processing module in the first embodiment of the present invention; FIG. 3 is a working principle diagram of the first signal class determining sub-module in the first embodiment provided by the present invention;

4 is a second signal class determining sub-module in the first embodiment of the present invention for determining non-noise Working principle diagram of the class of the signal;

FIG. 5 is a schematic diagram showing the operation of the second signal class determining sub-module in determining the uncertain signal in the first embodiment provided by the present invention. detailed description

Since speech signals, noise signals, and music signals have different distribution characteristics in the spectrum, and the variations between frames of frames of speech, music, and noise have their own characteristics. Embodiments of the present invention contemplate first extracting characteristic parameters of various audio signals based on characteristics of the signal frames, and then performing primary classification on the input narrowband audio or wideband audio digital signal frames according to the specific parameters, and dividing the input signals into non-noise. Signal frames (ie useful signals, including speech and music) and noise frames, mute signal frames. The signal frames that are judged to be non-noise are then further classified into voiced, unvoiced, and music signal frames.

The first embodiment of the present invention provides a general sound activity detection (GSAD), and its structure is as shown in FIG. 1, and includes: a signal preprocessing module, a feature parameter extraction module, and a signal class determination module. The signal class determination module includes a first signal class determination sub-module and a second signal class determination sub-module.

The signal transfer relationship between each module is as follows:

The input signal frame enters the signal preprocessing module, and the input digital sound signal sequence is subjected to frequency pre-emphasis and fast Fourier transform (FFT) in the module to prepare for the next feature parameter extraction. .

After the signal is processed by the signal preprocessing module, it is input to the feature parameter extraction module to obtain a feature parameter. In order to reduce the complexity of the system, all the characteristic parameters of the GSAD are extracted on the FFT spectrum. In addition, in this module, the noise parameters are extracted and updated to calculate the signal-to-noise ratio of the signal to control the update of some decision thresholds.

In the signal class determining module, first, the first signal class determining sub-module performs primary classification on the signal frame input by the signal pre-processing module according to the extracted feature parameter, and divides the input signal into non-noise signals (ie, useful signals, including Voice and music) and noise, mute signals. Then, in the second signal category determining sub-module, the signal that is determined to be non-noise by the first signal class determining sub-module is further The steps are divided into voiced, unvoiced and musical signals. This gives the final signal classification results through two levels of classification, namely noise, mute, voiced, unvoiced and music.

The following describes the specific processing of each module, as follows:

First, the signal preprocessing module

The working principle of the signal preprocessing module is shown in Fig. 2. The input signal is sequentially subjected to framing, pre-emphasis, windowing, FFT transformation and the like.

Framing: The input digital audio signal sequence is framed. The processed frame length is 10ms, and the frame shift is also 10ms, that is, there is no overlap between frames. If the processing system subsequent to this embodiment, such as the processing frame length of the encoder is a multiple of 10 ms, it can be processed into a 10 ms sound frame.

Pre-emphasis: Assuming that the sound sample value at time n is x(n), the speech sample value xp obtained after the pre-emphasis processing is as shown in the formula [1]:

χ _ρ {η) =

Equation [1] where α (0.9 < α < 1.0) is a pre-emphasis factor.

Windowing: The windowing process is to reduce the discontinuity of the signal at the beginning and end of the frame. It multiplies the speech sample value xp obtained by the pre-emphasis processing by the frame and the hamming window, as in the formula. [2]: x _w (n) = w(n) - x _p (n) Equation [2] where (0 ≤ w ≤ N - 1) ; is the hamming window function: w(«) = 0.54 - 0.46 cos (—

^ ^N _ ) Formula [3]

Where (0 ≤ w ≤ N - 1); N is the window length of the hamming window, which takes different values corresponding to different sampling frequencies N. For the examples in which the sampling frequencies are 8 kHz and 16 kHz, respectively, N is 80 and 160.

FFT frequency conversion: After the signal is processed by hamming window, standard FFT spectrum conversion is performed. When 8kHz and 16kHz sample rate is used, the window length is 256, which is not enough to fill zero. In other cases, change as appropriate.

Second, the feature parameter extraction module

The main function of the feature parameter extraction module is to extract the characteristic parameters of the input signal, mainly the spectral parameters. The spectral parameters include: short-term feature parameters and their class length characteristics. The short-term characteristic parameters include: spectral flux, 95% speech rolloff, zero crossing rate (zcr), intra-frame spectral variance, low-frequency signal band to full-band energy ratio; The long-term feature is the variance and moving average of each short-term feature parameter, and the number of statistical frames thereof is 10 frames in one embodiment of the present invention, that is, a duration of 100 ms.

The definitions and calculation formulas of these characteristic parameters are given below.

Definition χ (0 represents the i-th time domain sample value of a frame of sound signal, where 0 ≤ <M; T represents the number of frames; M represents the number of samples of a frame signal; N represents the window length of the FFT frequency conversion U_pw (k represents the amplitude value of the frequency of the current frame after the FFT transform of the signal at the frequency k; var represents the variance of the characteristic parameters of the current signal frame. The following is an example of the sound signal with a sample rate of 16 kHz, for short-term feature extraction For details:

1. Calculate the spectrum fluctuation (flux) and its variance (var_flux)

Calculating spectral fluctuation (Flux) is shown in Equation [4] below: fluxii) = pw (k) -U _ pw (k - 1)) 2 Equation [4]

The variance of the spectral fluctuation (var_flux) is calculated as shown in the formula [5]:

Var_flux(i) =— ^ (flux(j) - flux(i)) Formula [5]

^ΙΌ J='-W

Wherein, when the sampling frequency of the input audio signal is 16 kHz, ;^() represents the mean value of the normalized variable spectral wave parameter from the i-10th frame to the ith frame.

2. Calculate the 95% decay (rolloff) and the variance of the 95% spectral decay (rolloff_var). Rolloff represents the position of the frequency at which the energy accumulated from the low frequency to the high frequency accounts for 95% of the total energy. The specific calculation is as in the formula [6]:

K k N

Rolloff = arg max (∑U_pw(i) < 0.95*^U_pwG) ) Formula [6]

k=l i=l j=l

The variance of the 95% language rolloff (rolloff_var) is calculated as in the formula [7]: rolloff formula [7]

Where ro^ Z) represents the mean value of the 95% attenuation parameter from the i-10th frame to the ith frame - 3. Calculating the zero crossing rate ( _zcr ): zcr=^II{x(/)x(zl)<0 } Equation [8] where Π{Α} is determined by A. When A is truth, Π{Α} has a value of 1. When Α is false, II{A} has a value of 0.

4. Calculate the variance of the spectral amplitude in the frame ( magvar ):

2 ^N 2 ,

Mag var =— Σ, _pw(j)-U _pw) Equation [9] where,

The mean of the spectrum of the current high frequency portion.

5. Calculate the energy ratio of the low frequency band to the full frequency band (ratiol):

Ratio . formula [10]

Where R1 - F1 represent the lower limit of the low frequency sub-band Rl - F2 represents the upper limit representing the low frequency sub-band.

It can be seen from the above that when the feature parameters are extracted, they are extracted by a separate module, which is not extracted during the encoding algorithm, so the feature parameter extraction module does not depend on any existing encoder. Moreover, since the feature parameter extraction does not depend on the bandwidth, so that the GSAD does not depend on the signal sampling rate, the portability of the system is greatly enhanced.

Third, the first signal category decision sub-module

The function of the first signal class decision sub-module is to classify the input digital sound signals into three categories: mute, noise signals, and non-noise signals (ie, useful signals). It is mainly completed by initializing noise parameters, noise determination and noise update. Before initializing the noise parameters, adjust the long-term requirements of the initialization process according to the current environment (speech/music), and shorten the long-term requirement of the initialization process when the current environment is voice. When the current environment is music, extend the long-term requirements of the initialization process. The working principle of the first signal class determination sub-module is shown in Figure 3:

First, acquiring a feature parameter of the current frame;

Then, it is judged whether the initialization process of the noise parameter estimation value is completed:

If the noise parameter estimation value initialization process is not completed, the current signal frame is strictly determined according to the characteristic parameters of the current signal frame and the noise parameter threshold: comparing the characteristic parameters of the current signal frame with the noise parameter threshold, and comparing When the result belongs to the category of noise, it indicates that the strict judgment result is that the current signal frame is a noise frame; otherwise, the strict judgment result is that the current frame is a non-noise frame (ie, a useful signal):

When the noise determination is performed, the characteristic parameter of the variance of the spectral amplitude of the current signal frame, magvar, may be compared with the noise parameter threshold. When the variance of the spectral amplitude of the current signal frame is smaller than the noise parameter threshold, The strict decision result is that when the signal frame is a noise frame; otherwise, the strict decision result is that the current frame is a non-noise frame (ie, a useful signal).

If the strict result is that the current frame is a non-noise frame, a non-noise flag is output, and the signal-to-noise ratio (Peer-to-Noise Ratio, Posterior SNR) of the current frame is calculated using Equation [11]. Calculated Posterior

SNR is used to adjust the threshold of various characteristic parameters of mute, noise, unvoiced, voiced and music.

∑ _ pw(k)

PosteriorSNR = ^ Equation [ 11 ] where σ „ represents the variance of the noise and Κ is the number of subbands.

The purpose of adaptive adjustment and updating of feature parameters is to enable the decision process to obtain the same decision result under different SNR conditions. Because for the same signal, under different signal-to-noise ratios (reflected by Posterior SNR), the values of the same characteristic parameters are different, that is, the value of the characteristic parameters of the signal is affected by the signal-to-noise ratio. . Therefore, if the same decision result is reached under different signal-to-noise ratios, the decision threshold of the feature parameter, that is, the threshold value, is adaptively updated according to the signal-to-noise ratio of the current signal frame, and the specific update mode is affected by the signal-to-noise ratio of the corresponding feature parameter. Depending on the actual impact.

If the result of the strict determination is that the current signal frame is a noise frame, then the muting determination is continued according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, the signal energy of the current signal frame and one If the mute threshold is less than the mute threshold, it is determined that the current signal frame is muted, and then the mute flag is output; if it is greater than the mute threshold, the current signal frame is not muted, but a noise frame, and thus the output noise Marking, and initializing the noise parameter estimation value according to the current noise frame and the previous noise frame, and simultaneously recording the number of frames of the signal frame currently determined as the noise frame; when the number of recorded signal frames reaches the number of frames required for the initialization of the noise parameter estimation value Then, the initialization process of the flag noise parameter estimation value is completed. Among them, the mean value of the noise spectrum is involved in initializing the noise parameter estimation value: ^ and the variance σ„, and the calculation formulas are as shown in formula [12] and formula [13], respectively:

3⁄4 = ^∑U_PW formula [12] =1

T

^ = ^ ΣU_PW ² Equation [13] Equation [12] and the formula U_PW [13] is a matrix vector of the current frame subband power of the signal.

If the process of initializing the noise parameter estimation value is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the estimated value of the noise parameter; and performing noise determination according to the spectral distance, that is, the calculated spectral distance and the spectral distance are wide Comparing the values, if the calculated spectral distance is less than the set spectral distance threshold, proceeding to perform silence determination according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, performing signal energy of the current signal frame and a mute threshold Comparing, if it is less than the mute threshold, it determines that the current signal frame is muted, and then outputs a mute flag; if it is greater than the mute threshold, it indicates that the current signal frame is not muted, but a noise frame, and then the noise flag is output, and the current noise flag is used. The spectral mean and variance σ _{η of the} signal frame update the noise parameter estimate and output the noise parameter estimate. The update formulas are shown in equation [14] and formula [15], respectively:

Ε _η {ί) = {\ - β)Ε _η {ί - \) + βΕ _η {ί) Formula [14]

σ{ί) = {\ - α)σ _η {ί - \) + ασ _η {ί) Formula [ 15 ]

If the calculated spectral distance is greater than the set spectral distance threshold, then the current signal frame is a non-noise frame, then the Posterior SNR of the current signal frame is calculated using Equation [11], and the characteristics of the signal are adjusted using the currently calculated Posterior SNR. The parameter is wide and outputs a non-noisy (useful signal) flag.

Fourth, the second signal category decision sub-module If the current signal frame is determined by the first signal class determination sub-module, if the type is judged as a noise frame, the decision result is directly output. If the decision is a non-noise frame, the current signal frame enters the second signal class determination sub-module for voiced sound. , the classification of unvoiced and musical signals. The specific decision can be made in two steps. The first step is to strictly judge the signal according to the characteristics of the characteristic parameters, and the non-noise signal is judged as voiced, unvoiced, and music. The judgment method used is mainly hard judgment (wide value judgment). . The second step is mainly for the uncertain signal that belongs to both voiced and music, or neither voiced nor music. It can use a variety of auxiliary judgment methods, such as the method of using probability judgment, that is, using probability model to calculate separately. The uncertainty signal belongs to the probability of voiced and musical signals, and the most probable is the final classification of the uncertain signal. The probability model may be a Gaussian mixture model GMM, and the parameters thereof are parameters extracted by the feature parameter extraction module.

The decision process of the first step is as shown in FIG. 4, first extracting the feature parameters of the non-noise frame output by the first signal class determination sub-module, and then comparing the feature parameters of the non-noise signal frame with the unvoiced parameter threshold:

If the comparison result between the characteristic parameter of the non-noise signal frame and the unvoiced parameter threshold is in the category of unvoiced sound, determining that the non-noise signal frame is unvoiced, and outputting the unvoiced signal flag; the characteristic parameter used in determining the unvoiced sound may be Zero rate ( _ZC r ), if the zero-crossing rate (zcr ) is greater than the unvoiced parameter threshold, the non-noise signal frame is determined to be unvoiced, and the unvoiced signal flag is output.

If the comparison result of the characteristic parameter of the non-noise signal frame and the unvoiced parameter threshold does not belong to the category of unvoiced sound, continue to determine whether the non-noise signal frame belongs to voiced sound, if the characteristic parameter of the non-noise signal frame is The comparison result of the voiced parameter threshold value belongs to the category of voiced sound, then it is determined that the non-noise frame belongs to voiced sound, and the voiced sound signal flag is set to 1; otherwise, it is determined that the non-noise frame is not voiced, and the voiced signal flag is set to 0; The characteristic parameters used in voiced sounds may be the flux and its variance (var-flux), if the spectral fluctuation is greater than the corresponding voiced parameter threshold, or the vari-variance (var-flux) is greater than The corresponding voiced parameter threshold value is determined as the voiced sound, and the voiced sound signal flag is set to 1; otherwise, it is determined that the non-noise frame is not voiced, and the voiced signal flag is set to 0.

If the comparison between the characteristic parameter of the non-noise signal frame and the unvoiced parameter threshold is not unvoiced And determining whether the non-noise signal frame belongs to the category of music, and if the comparison result of the feature parameter of the non-noise signal frame and the music parameter threshold is in the category of music, determining that the non-noise frame belongs to Music, and set the music signal flag = 1; otherwise, it is determined that the non-noise frame does not belong to music, and the music signal flag = 0 is set. The characteristic parameter used in determining the music may be a moving average of the var flux (varmov_flux). If the varmov_flux is smaller than the music parameter threshold, the non-noise frame is determined as music, and the music signal flag is set to 1; Otherwise, it is determined that the non-noise frame does not belong to music, and the music signal flag = 0 is set.

If the non-noise frame belongs to both voiced and music, or the non-noise frame is neither voiced nor music, then the signal is judged as an indeterminate signal, and then the second step of the auxiliary decision method, such as probability Judging, the decision is continued on the uncertain signal, and it is judged as a kind of voiced sound or music, so that the non-noise is finally divided into voiced sound, unvoiced sound and music. The example of continuing decision on the uncertain signal by means of probabilistic judgment is illustrated as an example, as shown in Figure 5:

Firstly, the probabilistic model is used to calculate the probability that the uncertain signal frame belongs to the voiced and music signals, and the sound category corresponding to the maximum probability value is used as the final classification of the uncertain signal frame; then the type flag of the uncertain signal frame is modified; A type flag of the signal frame is output.

When the probability decision method is utilized, the calculated maximum probability may be compared with the set probability threshold pth, and if the calculated maximum probability exceeds the probability threshold pth, subsequent to the non-noise frame The signal frame is smeared; otherwise, no smearing is performed.

In the above embodiment, when discriminating the sound category to which the current signal frame belongs, the characteristic parameter used may be one of the above-listed characteristic parameters, or may be combined. It is only necessary to use these feature parameters in combination with the feature parameter threshold to determine the sound category to which the current signal frame belongs, without departing from the idea of the present invention.

The second embodiment provided by the present invention is a voice activity detecting method, and the main idea is: extracting feature parameters of a current signal frame; and determining, according to the feature parameter and the set parameter threshold, the current signal frame attribution Sound category. The specific implementation process includes the following contents:

First, sequence framing processing, pre-emphasis processing, windowing processing, and fast Fourier transform FFT processing are sequentially performed on the current signal frame to obtain a corresponding frequency domain signal; and then the obtained current frequency domain signal is extracted. The characteristic parameters of the frame. The pre-emphasis processing is to enhance the spectrum of the input current signal frame, and the windowing process is to reduce the discontinuity of the signal at the beginning and end of the frame. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.

Then, it is judged whether the noise parameter estimation value initialization process is completed:

If the noise parameter estimation value initialization process is not completed, the noise is strictly determined according to the characteristic parameter and the set noise parameter threshold:

Comparing the characteristic parameter with the set noise parameter threshold, and when the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing according to the characteristic parameter and the mute parameter threshold Silence determination: comparing the feature parameter with the silence parameter threshold, when the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame, and outputting a corresponding silence flag; otherwise, determining a current signal frame a noise frame, and outputting a noise frame flag, calculating a noise parameter estimation value according to the current noise frame and the previous noise frame; and recording a frame number of the signal frame currently determined as a noise frame; when the number of recorded signal frames reaches the noise When the parameter estimation value initializes the required number of frames, the flag noise parameter estimation value initialization process is completed. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.

When the comparison result of the feature parameter and the set noise parameter threshold does not belong to the category of noise, determining that the current signal frame is a non-noise frame, calculating a Posterior SNR of the current signal frame, and utilizing The Posterior SNR adjusts a threshold of the set characteristic parameter. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.

After the noise parameter estimation value initialization process is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the noise parameter estimation value, and then performing the current signal frame according to the spectral distance and the set spectral distance threshold Noise determination:

If the spectrum distance is less than the set spectral distance threshold, determining that the current signal frame is a noise frame, proceeding to perform silence determination according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, the current signal frame. The signal energy is compared with a mute threshold. If it is less than the mute threshold, it is determined that the current signal frame is muted, and then the mute flag is output; if it is greater than the mute threshold, the current signal frame is not muted, but the noise frame is , then outputting a noise flag and utilizing the noise of the current frame Acoustic parameters update the noise parameter estimate;

Otherwise, it is determined that the current signal frame is non-noise, then the Posterior SNR of the current signal frame is calculated, and the threshold value of the threshold is determined by using the Posterior SNR to adjust the set feature parameter. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.

After the above process, it can be judged that the input current signal frame belongs to three categories of noise, mute and non-noise, and then it is determined which specific non-noise category the current signal frame belongs to, as follows:

When the current signal frame is non-noise, according to the unvoiced parameter threshold and the characteristic parameter of the current signal frame, it is determined whether the current signal frame is unvoiced:

Comparing the characteristic parameter of the current signal frame with the unvoiced parameter threshold, and when the comparison result belongs to the unvoiced domain, determining that the current signal frame is unvoiced, then outputting the corresponding unvoiced flag;

Otherwise, determining whether the current signal frame is voiced according to the voiced parameter threshold and the characteristic parameter of the current signal frame: comparing the feature parameter of the current signal frame with the voiced parameter threshold, when the comparison result belongs to In the category of voiced sound, it is determined that the current signal frame is voiced; otherwise, it is determined that the current signal frame does not belong to voiced sound; and the current signal frame is determined according to the music parameter threshold and the characteristic parameter of the current signal frame. Whether it is music: comparing the feature parameter of the current signal frame with the music parameter threshold, and when the comparison result belongs to the category of music, determining that the current signal frame is music; otherwise, determining that the current signal frame is not Belongs to music. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.

When the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the probability model is used to calculate the probability that the current signal frame belongs to voiced sound and music, and select The sound category corresponding to the large probability value is used as the attribution category of the current signal frame. The specific implementation is similar to the related description in the first embodiment and will not be described in detail herein.

Comparing the large probability value with the probability threshold, when the large probability value is greater than the probability threshold, the tailing of a certain number of signal frames subsequent to the current signal frame according to the sound category to which the current signal frame belongs deal with. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.

As can be seen from the specific embodiments provided by the present invention described above, embodiments of the present invention extract feature parameters used in the classification process when voice activity detection is required, and thus do not depend on a specific coding. The code algorithm is independent and easy to maintain and update. In addition, the embodiment of the present invention determines the sound category to which the current signal frame belongs according to the extracted feature parameter and the set parameter threshold, and can divide the input narrowband audio or wideband audio digital signal into mute, noise, voiced, The five types of unvoiced and music, when applied in the field of speech coding technology, can not only serve as the basis for the newly developed variable rate audio coding algorithm and standard rate selection, but also provide a rate selection for existing coding standards without VAD algorithm. According to the invention, the present invention can also be applied to other speech signal processing fields such as speech enhancement, speech recognition, speaker recognition, etc., and has strong versatility. The spirit and scope of the Ming. Thus, it is intended that the present invention cover the modifications and the modifications of the invention

Claims

Rights request

A method for detecting a sound activity, the method comprising:

Extracting characteristic parameters of the current signal frame when sound activity detection is required;

2. The method of claim 1 wherein prior to the process of extracting characteristic parameters of the current signal frame comprises:

The current signal frame is sequentially subjected to sequence framing processing and fast Fourier transform FFT processing to obtain a corresponding frequency i or signal.

3. The method according to claim 2, further comprising: before extracting the feature parameters of the current signal frame:

The signal frame obtained by performing sequence framing processing on the current signal frame is subjected to pre-emphasis processing and/or windowing processing.

The method of claim 1, wherein the determining, according to the feature parameter and the set parameter threshold, the sound category to which the current signal frame belongs is specifically:

Determining, according to the characteristic parameter and the set parameter threshold, a sound category to which the current signal frame belongs is a noise frame, a silence frame, or a non-noise frame; and when the current signal frame is a non-noise frame, .

The method according to claim 4, wherein, according to the feature parameter and the set parameter threshold, determining that the sound category to which the current signal frame belongs is a noise frame, a silence frame, or a non-noise frame. The process specifically includes:

When the noise parameter estimation value initialization process is not completed, the noise strict determination is performed according to the characteristic parameter and the noise parameter threshold:

Comparing the characteristic parameter with a noise parameter threshold, and if the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing static according to the characteristic parameter and the silence parameter threshold Sound determination: comparing the feature parameter with the silence parameter threshold, and when the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame; otherwise, determining that the current frame is a noise frame, according to the Calculating noise parameter estimates for the current noise frame and its previous noise frame;

The feature parameter is compared with the set noise parameter threshold, and when the comparison result does not belong to the category of noise, the current signal frame is determined to be a non-noise frame.

6. The method of claim 5, further comprising:

After determining that the current frame is a noise frame, the number of frames of the signal frame currently determined to be a noise frame is recorded; when the number of recorded signal frames reaches the number of frames required for initialization of the noise parameter estimation value, the initialization process of the flag noise parameter estimation value is completed.

The method according to claim 4, wherein the determining, according to the feature parameter and the set parameter threshold, that the sound category to which the current signal frame belongs is a noise frame, a silence frame, or a non-noise The process of the frame specifically includes:

Comparing the spectral distance with the set spectral distance threshold, and when the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing silence determination according to the characteristic parameter and the silence parameter threshold : comparing the feature parameter with the silence parameter threshold, and when the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame; otherwise, determining that the current frame is a noise frame, and using the current The signal parameter of the frame updates the noise parameter estimate;

Otherwise, it is determined that the current signal frame is a non-noise frame.

8. The method according to claim 5 or 7, wherein the method further comprises:

When it is determined that the current signal frame is non-noise, the signal-to-noise ratio Posterior SNR of the current signal frame is calculated, and the threshold value of the set characteristic parameter is adjusted by using the Posterior SNR.

The method according to claim 4, wherein when the current signal frame is a non-noise frame, the process of determining the sound category to which the current signal frame belongs according to the feature parameter and the set parameter threshold include: Determining whether the current signal frame is unvoiced according to the unvoiced parameter threshold and the characteristic parameter of the current signal frame:

Comparing the characteristic parameter of the current signal frame with the unvoiced parameter threshold, and when the comparison result belongs to the unvoiced category, determining that the current signal frame is unvoiced;

Otherwise, determining whether the current signal frame is voiced according to the voiced parameter threshold and the characteristic parameter of the current signal frame: comparing the feature parameter of the current signal frame with the voiced parameter threshold, when the comparison result belongs to In the category of voiced sound, it is determined that the current signal frame is voiced; otherwise, it is determined that the current signal frame does not belong to voiced sound; and the current signal frame is determined according to the music parameter threshold and the characteristic parameter of the current signal frame. Whether it is music: comparing the feature parameter of the current signal frame with the music parameter threshold, and when the comparison result belongs to the category of music, determining that the current signal frame is music; otherwise, determining the current signal frame Not a music.

10. The method according to claim 9, wherein when the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the basis is The process of determining the sound category to which the current signal frame belongs by the characteristic parameter and the set parameter threshold further includes:

The probability model is used to calculate the probability that the current signal frame belongs to voiced sound and music, and the sound category corresponding to the large probability value is selected as the attribution category of the current signal frame.

The method according to claim 10, wherein when the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the base The process of determining the sound category to which the current signal frame belongs by the characteristic parameter and the set parameter threshold further includes:

Comparing the large probability value with the probability threshold, when the large probability value is greater than the probability threshold, the tailing of a certain number of signal frames subsequent to the current signal frame according to the sound category to which the current signal frame belongs deal with.

12. The sound activity detector, wherein the sound activity detector comprises: a feature parameter extraction module, configured to extract a feature parameter of a current signal frame when the sound activity detection is required; And a signal category determining module, configured to determine, according to the feature parameter and the set parameter threshold, a sound category to which the current signal frame belongs.

The detector of claim 12, wherein the sound activity detector further comprises: a signal preprocessing module, configured to sequentially perform sequence framing processing and fast Fourier transform FFT processing on the current signal frame, and obtain A corresponding frequency domain signal is provided to the feature parameter extraction module and the signal class determination module.

14. The detector of claim 13, wherein the signal pre-processing module is further configured to:

The detector of claim 12, wherein the signal class determination module comprises:

The first signal category determining sub-module is configured to perform strict noise determination according to the characteristic parameter and the set noise parameter threshold value when the noise parameter estimation value initializing process is not completed:

If the feature parameter is compared with the set noise parameter threshold, and the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing silence determination according to the feature parameter and the silence parameter threshold. If the feature parameter is compared with the silence parameter threshold, and the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame; otherwise, determining that the current frame is a noise frame, according to the current noise frame and the previous Noise frame calculation noise parameter estimation value;

If the feature parameter is compared with the set noise parameter threshold, and the comparison result does not belong to the category of noise, then the current signal frame is determined to be a non-noise frame.

The detector according to claim 15, wherein the first signal class determining sub-module is further configured to:

The number of frames of the signal frame currently determined as the noise frame is recorded; when the number of recorded signal frames reaches the number of frames required for the initialization of the noise parameter estimation value, the initialization process of the flag noise parameter estimation value is completed.

The detector according to claim 15, wherein the first signal class determining sub-module is further configured to: After the noise parameter estimation value initialization process is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the noise parameter estimation value, and then performing the current signal frame according to the spectral distance and the set spectral distance threshold Noise determination:

Comparing the spectral distance with a set spectral distance threshold, and when the comparison result belongs to the category of noise, performing a silence determination according to the characteristic parameter and the silence parameter threshold: comparing the characteristic parameter with the silence parameter threshold And determining that the current signal frame is a silence frame when the comparison result belongs to the category of silence; otherwise, determining that the current signal frame is a noise frame, and updating the noise parameter estimation value by using a noise parameter of the current frame;

Otherwise, it is determined that the current signal frame is non-noise.

The detector according to claim 15 or 17, wherein the first signal class determining sub-module is further configured to:

The detector of claim 18, wherein the signal class determination module further comprises:

a second signal class determining submodule, configured to determine, according to the unvoiced parameter threshold, and the characteristic parameter of the current signal frame, whether the current signal frame is unvoiced when the current signal frame is non-noise: The characteristic parameter is compared with the unvoiced parameter threshold. When the comparison result belongs to the unvoiced category, the current signal frame is determined to be unvoiced; otherwise, the current parameter is determined according to the voiced parameter threshold and the characteristic parameter of the current signal frame. Whether the signal frame is voiced:

Comparing the characteristic parameter of the current signal frame with the voiced parameter threshold, and when the comparison result belongs to the category of voiced sound, determining that the current signal frame is voiced; otherwise, determining that the current signal frame is not voiced; Determining, according to a music parameter threshold, and a characteristic parameter of the current signal frame, whether the current signal frame is music: comparing a feature parameter of the current signal frame with a threshold value of the music parameter, when the comparison result belongs to music In the case of the category, it is determined that the current signal frame is music; otherwise, it is determined that the current signal frame does not belong to music.

The detector according to claim 19, wherein said second signal class determiner The module is also used to:

When the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the probability model is used to calculate the probability that the current signal frame belongs to voiced sound and music, and select The sound category corresponding to the large probability value is used as the attribution category of the current signal frame.

The detector according to claim 20, wherein the second signal category determining sub-module is further configured to: