WO2008067719A1 - Sound activity detecting method and sound activity detecting device - Google Patents

Sound activity detecting method and sound activity detecting device Download PDF

Info

Publication number
WO2008067719A1
WO2008067719A1 PCT/CN2007/003364 CN2007003364W WO2008067719A1 WO 2008067719 A1 WO2008067719 A1 WO 2008067719A1 CN 2007003364 W CN2007003364 W CN 2007003364W WO 2008067719 A1 WO2008067719 A1 WO 2008067719A1
Authority
WO
WIPO (PCT)
Prior art keywords
current signal
frame
noise
parameter
signal frame
Prior art date
Application number
PCT/CN2007/003364
Other languages
French (fr)
Chinese (zh)
Inventor
Qin Yan
Haojiang Deng
Jun Wang
Xuewen Zeng
Jun Zhang
Libin Zhang
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2008067719A1 publication Critical patent/WO2008067719A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present invention relates to speech signal processing techniques, and more particularly to a voice activity detection method and a voice activity detector. Background technique
  • VAD Voice Activity Detection
  • Speech Endpoint Detection When it is applied in speech recognition technology, It is commonly referred to as Speech Endpoint Detection, and when it is used in speech enhancement technology, it is commonly referred to as Speech Pause Detection.
  • Speech Endpoint Detection When it is applied in speech recognition technology, It is commonly referred to as Speech Endpoint Detection, and when it is used in speech enhancement technology, it is commonly referred to as Speech Pause Detection.
  • Voice activity detection technology is primarily developed for speech signals input into the encoder.
  • speech coding technology the audio signals input into the encoder are divided into two types: background noise and active speech, and then the background noise and the active speech are encoded at different rates, that is, the background noise is used at a lower rate. Coding, encoding the active speech at a higher rate, thereby reducing the average bit rate of communication and promoting the development of variable rate speech coding technology.
  • the signal input to the encoder is diversified, that is, it is not limited to speech, but also includes music and various noises. Therefore, before encoding the input signal, it is necessary to Different input signals are differentiated so that different code rates can be used, and even different core coding algorithms are used to encode different input signals.
  • the prior art related to the present invention is a multi-rate coding standard broadband adaptation developed by the 3rd Generation Partnership Project (3GPP) for the third generation mobile communication system.
  • Multi-rate speech coder (Adaptive Multi-Rate - Wideband, AMR-WB+), which has two core coding algorithms: Algebraic Code Excited Linear Prediction (ACELP) and Transform Coded Excited (TCX) mode.
  • ACELP Algebraic Code Excited Linear Prediction
  • TCX Transform Coded Excited
  • the ACELP mode is suitable for speech signal coding.
  • the TCX is suitable for wideband signals containing music, so the choice of the two modes can be considered as the choice of voice and music.
  • the mode selection methods of ACELP and TCX in coding algorithm are open-loop and closed-loop.
  • Closed-loop selection is a selection method based on perceptually weighted SNR for traversal search, which is independent of VAD module.
  • Open-loop selection is based on AMR. Based on the VAD module of the WB+ encoding algorithm, the short-term and long-term statistics of the feature parameters are added, and the non-speech features are improved, and the classification of speech and music can be realized to a certain extent; and when ACELP is continuously selected When the number of modes is less than three times, a small-scale traversal search is still performed, and since the feature parameters used in the classification are obtained by the coding algorithm, the coupling of the method with the AMR-WB+ coding algorithm is very close.
  • the second prior art related to the present invention is a multi-rate mode voice coding standard (SMV) developed by the third generation Mobile Communication Standardization Partnership Project 2 (3GPP2) for the CDMA2000 system. It has four encoding rates to choose from, 9.6, 4.8, 2.4, and 1.2 kbps (actual net rate of 8.55, 4.0, 2.0, and 0.8 kbps) to support mobile operators between system capacity and voice quality. Flexible choice, the algorithm contains a music detection module.
  • SMV multi-rate mode voice coding standard
  • the module further calculates the parameters required for the music detection by using the partial parameters calculated by the VAD module, and executes after the VAD detection, supplements the judgment according to the output decision of the VAD module, and the calculated parameters required for the music detection, and outputs the music and
  • the result of non-musical classification is therefore very close to the coding algorithm.
  • the prior art detects the music signal based on the VAD technology in the existing speech coding standard, and thus is closely related to the encoding algorithm, that is, the coupling with the encoder itself is too large, and independence. Generality and maintainability are generally poor, and the cost of porting between coding is high.
  • the existing VAD algorithms are developed for speech signals, so only the input audio signals are divided into two types: noise and speech (non-noise), even if the detection of the music signal is included, it is only one of the VAD decisions. Amend and supplement. Therefore, as the codec algorithm application scenario gradually transitions from processing speech to processing multimedia speech (including multimedia music;), the codec algorithm itself is gradually narrower. Bringing to broadband extensions, so as the application scenario changes, the simple output categories of existing VAD algorithms are clearly insufficient to describe a wide variety of audio signal characteristics. Summary of the invention
  • Embodiments of the present invention provide a voice activity detecting method and a voice activity detector that are capable of independently extracting feature parameters of a signal from an encoding algorithm and using the extracted feature parameters to determine a sound category to which the input signal frame belongs.
  • An embodiment of the present invention provides a voice activity detecting method, including:
  • An embodiment of the present invention also provides a sound activity detector, including:
  • a feature parameter extraction module configured to extract feature parameters in a current signal frame when sound activity detection is required
  • a signal class determining module configured to determine, according to the feature parameter and the set parameter threshold, a sound category to which the current signal frame belongs.
  • the embodiment of the present invention extracts the feature parameters used in the process of determining the sound category to which the input signal frame belongs when the sound activity detection is required, and thus does not depend on A specific coding algorithm is performed independently, which facilitates maintenance and update.
  • Figure 1 is a structural view of a first embodiment of the present invention
  • FIG. 2 is a schematic diagram of the operation of the signal pre-processing module in the first embodiment of the present invention
  • FIG. 3 is a working principle diagram of the first signal class determining sub-module in the first embodiment provided by the present invention
  • FIG. 5 is a schematic diagram showing the operation of the second signal class determining sub-module in determining the uncertain signal in the first embodiment provided by the present invention. detailed description
  • Embodiments of the present invention contemplate first extracting characteristic parameters of various audio signals based on characteristics of the signal frames, and then performing primary classification on the input narrowband audio or wideband audio digital signal frames according to the specific parameters, and dividing the input signals into non-noise.
  • Signal frames ie useful signals, including speech and music
  • noise frames mute signal frames.
  • the signal frames that are judged to be non-noise are then further classified into voiced, unvoiced, and music signal frames.
  • the first embodiment of the present invention provides a general sound activity detection (GSAD), and its structure is as shown in FIG. 1, and includes: a signal preprocessing module, a feature parameter extraction module, and a signal class determination module.
  • the signal class determination module includes a first signal class determination sub-module and a second signal class determination sub-module.
  • the input signal frame enters the signal preprocessing module, and the input digital sound signal sequence is subjected to frequency pre-emphasis and fast Fourier transform (FFT) in the module to prepare for the next feature parameter extraction.
  • FFT fast Fourier transform
  • the signal After the signal is processed by the signal preprocessing module, it is input to the feature parameter extraction module to obtain a feature parameter.
  • the feature parameter extraction module In order to reduce the complexity of the system, all the characteristic parameters of the GSAD are extracted on the FFT spectrum.
  • the noise parameters are extracted and updated to calculate the signal-to-noise ratio of the signal to control the update of some decision thresholds.
  • the first signal class determining sub-module performs primary classification on the signal frame input by the signal pre-processing module according to the extracted feature parameter, and divides the input signal into non-noise signals (ie, useful signals, including Voice and music) and noise, mute signals. Then, in the second signal category determining sub-module, the signal that is determined to be non-noise by the first signal class determining sub-module is further The steps are divided into voiced, unvoiced and musical signals. This gives the final signal classification results through two levels of classification, namely noise, mute, voiced, unvoiced and music.
  • the working principle of the signal preprocessing module is shown in Fig. 2.
  • the input signal is sequentially subjected to framing, pre-emphasis, windowing, FFT transformation and the like.
  • the input digital audio signal sequence is framed.
  • the processed frame length is 10ms, and the frame shift is also 10ms, that is, there is no overlap between frames. If the processing system subsequent to this embodiment, such as the processing frame length of the encoder is a multiple of 10 ms, it can be processed into a 10 ms sound frame.
  • Pre-emphasis Assuming that the sound sample value at time n is x(n), the speech sample value xp obtained after the pre-emphasis processing is as shown in the formula [1]:
  • ⁇ ⁇ ⁇ Equation [1] where ⁇ (0.9 ⁇ ⁇ ⁇ 1.0) is a pre-emphasis factor.
  • N is the window length of the hamming window, which takes different values corresponding to different sampling frequencies N.
  • the sampling frequencies are 8 kHz and 16 kHz, respectively, N is 80 and 160.
  • FFT frequency conversion After the signal is processed by hamming window, standard FFT spectrum conversion is performed. When 8kHz and 16kHz sample rate is used, the window length is 256, which is not enough to fill zero. In other cases, change as appropriate.
  • the main function of the feature parameter extraction module is to extract the characteristic parameters of the input signal, mainly the spectral parameters.
  • the spectral parameters include: short-term feature parameters and their class length characteristics.
  • the short-term characteristic parameters include: spectral flux, 95% speech rolloff, zero crossing rate (zcr), intra-frame spectral variance, low-frequency signal band to full-band energy ratio;
  • the long-term feature is the variance and moving average of each short-term feature parameter, and the number of statistical frames thereof is 10 frames in one embodiment of the present invention, that is, a duration of 100 ms.
  • (0 represents the i-th time domain sample value of a frame of sound signal, where 0 ⁇ ⁇ M; T represents the number of frames; M represents the number of samples of a frame signal; N represents the window length of the FFT frequency conversion U_pw (k represents the amplitude value of the frequency of the current frame after the FFT transform of the signal at the frequency k; var represents the variance of the characteristic parameters of the current signal frame.
  • U_pw k represents the amplitude value of the frequency of the current frame after the FFT transform of the signal at the frequency k; var represents the variance of the characteristic parameters of the current signal frame.
  • variable_flux The variance of the spectral fluctuation (var_flux) is calculated as shown in the formula [5]:
  • ; ⁇ () represents the mean value of the normalized variable spectral wave parameter from the i-10th frame to the ith frame.
  • Rolloff represents the position of the frequency at which the energy accumulated from the low frequency to the high frequency accounts for 95% of the total energy.
  • rolloff_var the variance of the 95% spectral decay
  • rolloff_var The variance of the 95% language rolloff (rolloff_var) is calculated as in the formula [7]: rolloff formula [7]
  • ro ⁇ Z represents the mean value of the 95% attenuation parameter from the i-10th frame to the ith frame - 3.
  • R1 - F1 represent the lower limit of the low frequency sub-band
  • Rl - F2 represents the upper limit representing the low frequency sub-band
  • the feature parameters are extracted by a separate module, which is not extracted during the encoding algorithm, so the feature parameter extraction module does not depend on any existing encoder. Moreover, since the feature parameter extraction does not depend on the bandwidth, so that the GSAD does not depend on the signal sampling rate, the portability of the system is greatly enhanced.
  • the function of the first signal class decision sub-module is to classify the input digital sound signals into three categories: mute, noise signals, and non-noise signals (ie, useful signals). It is mainly completed by initializing noise parameters, noise determination and noise update. Before initializing the noise parameters, adjust the long-term requirements of the initialization process according to the current environment (speech/music), and shorten the long-term requirement of the initialization process when the current environment is voice. When the current environment is music, extend the long-term requirements of the initialization process.
  • the working principle of the first signal class determination sub-module is shown in Figure 3:
  • the current signal frame is strictly determined according to the characteristic parameters of the current signal frame and the noise parameter threshold: comparing the characteristic parameters of the current signal frame with the noise parameter threshold, and comparing
  • the result belongs to the category of noise, it indicates that the strict judgment result is that the current signal frame is a noise frame; otherwise, the strict judgment result is that the current frame is a non-noise frame (ie, a useful signal):
  • the characteristic parameter of the variance of the spectral amplitude of the current signal frame, magvar may be compared with the noise parameter threshold.
  • the variance of the spectral amplitude of the current signal frame is smaller than the noise parameter threshold,
  • the strict decision result is that when the signal frame is a noise frame; otherwise, the strict decision result is that the current frame is a non-noise frame (ie, a useful signal).
  • SNR is used to adjust the threshold of various characteristic parameters of mute, noise, unvoiced, voiced and music.
  • PosteriorSNR ⁇ Equation [ 11 ] where ⁇ nieth represents the variance of the noise and ⁇ is the number of subbands.
  • the purpose of adaptive adjustment and updating of feature parameters is to enable the decision process to obtain the same decision result under different SNR conditions. Because for the same signal, under different signal-to-noise ratios (reflected by Posterior SNR), the values of the same characteristic parameters are different, that is, the value of the characteristic parameters of the signal is affected by the signal-to-noise ratio. . Therefore, if the same decision result is reached under different signal-to-noise ratios, the decision threshold of the feature parameter, that is, the threshold value, is adaptively updated according to the signal-to-noise ratio of the current signal frame, and the specific update mode is affected by the signal-to-noise ratio of the corresponding feature parameter. Depending on the actual impact.
  • the muting determination is continued according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, the signal energy of the current signal frame and one If the mute threshold is less than the mute threshold, it is determined that the current signal frame is muted, and then the mute flag is output; if it is greater than the mute threshold, the current signal frame is not muted, but a noise frame, and thus the output noise Marking, and initializing the noise parameter estimation value according to the current noise frame and the previous noise frame, and simultaneously recording the number of frames of the signal frame currently determined as the noise frame; when the number of recorded signal frames reaches the number of frames required for the initialization of the noise parameter estimation value Then, the initialization process of the flag noise parameter estimation value is completed.
  • the mean value of the noise spectrum is involved in initializing the noise parameter estimation value: ⁇ and the variance ⁇ vine, and the calculation formulas are as shown in formula [12] and formula [13
  • ⁇ ⁇ U_PW 2 Equation [13] Equation [12] and the formula U_PW [13] is a matrix vector of the current frame subband power of the signal.
  • the process of initializing the noise parameter estimation value is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the estimated value of the noise parameter; and performing noise determination according to the spectral distance, that is, the calculated spectral distance and the spectral distance are wide Comparing the values, if the calculated spectral distance is less than the set spectral distance threshold, proceeding to perform silence determination according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, performing signal energy of the current signal frame and a mute threshold Comparing, if it is less than the mute threshold, it determines that the current signal frame is muted, and then outputs a mute flag; if it is greater than the mute threshold, it indicates that the current signal frame is not muted, but a noise frame, and then the noise flag is output, and the current noise flag is used.
  • the spectral mean and variance ⁇ ⁇ of the signal frame update the noise parameter estimate and output the noise parameter estimate.
  • the update formulas are shown in equation
  • the current signal frame is a non-noise frame
  • the Posterior SNR of the current signal frame is calculated using Equation [11], and the characteristics of the signal are adjusted using the currently calculated Posterior SNR.
  • the parameter is wide and outputs a non-noisy (useful signal) flag.
  • the second signal category decision sub-module If the current signal frame is determined by the first signal class determination sub-module, if the type is judged as a noise frame, the decision result is directly output. If the decision is a non-noise frame, the current signal frame enters the second signal class determination sub-module for voiced sound. , the classification of unvoiced and musical signals.
  • the specific decision can be made in two steps. The first step is to strictly judge the signal according to the characteristics of the characteristic parameters, and the non-noise signal is judged as voiced, unvoiced, and music. The judgment method used is mainly hard judgment (wide value judgment). .
  • the second step is mainly for the uncertain signal that belongs to both voiced and music, or neither voiced nor music.
  • the uncertainty signal belongs to the probability of voiced and musical signals, and the most probable is the final classification of the uncertain signal.
  • the probability model may be a Gaussian mixture model GMM, and the parameters thereof are parameters extracted by the feature parameter extraction module.
  • the decision process of the first step is as shown in FIG. 4, first extracting the feature parameters of the non-noise frame output by the first signal class determination sub-module, and then comparing the feature parameters of the non-noise signal frame with the unvoiced parameter threshold:
  • the characteristic parameter used in determining the unvoiced sound may be Zero rate ( ZC r ), if the zero-crossing rate (zcr ) is greater than the unvoiced parameter threshold, the non-noise signal frame is determined to be unvoiced, and the unvoiced signal flag is output.
  • the comparison result of the characteristic parameter of the non-noise signal frame and the unvoiced parameter threshold does not belong to the category of unvoiced sound, continue to determine whether the non-noise signal frame belongs to voiced sound, if the characteristic parameter of the non-noise signal frame is The comparison result of the voiced parameter threshold value belongs to the category of voiced sound, then it is determined that the non-noise frame belongs to voiced sound, and the voiced sound signal flag is set to 1; otherwise, it is determined that the non-noise frame is not voiced, and the voiced signal flag is set to 0;
  • the characteristic parameters used in voiced sounds may be the flux and its variance (var-flux), if the spectral fluctuation is greater than the corresponding voiced parameter threshold, or the vari-variance (var-flux) is greater than The corresponding voiced parameter threshold value is determined as the voiced sound, and the voiced sound signal flag is set to 1; otherwise, it is determined that the non-noise frame is not voiced, and the voiced
  • the characteristic parameter used in determining the music may be a moving average of the var flux (varmov_flux).
  • the signal is judged as an indeterminate signal, and then the second step of the auxiliary decision method, such as probability Judging, the decision is continued on the uncertain signal, and it is judged as a kind of voiced sound or music, so that the non-noise is finally divided into voiced sound, unvoiced sound and music.
  • the auxiliary decision method such as probability Judging
  • the probabilistic model is used to calculate the probability that the uncertain signal frame belongs to the voiced and music signals, and the sound category corresponding to the maximum probability value is used as the final classification of the uncertain signal frame; then the type flag of the uncertain signal frame is modified; A type flag of the signal frame is output.
  • the calculated maximum probability may be compared with the set probability threshold pth, and if the calculated maximum probability exceeds the probability threshold pth, subsequent to the non-noise frame The signal frame is smeared; otherwise, no smearing is performed.
  • the characteristic parameter used when discriminating the sound category to which the current signal frame belongs, may be one of the above-listed characteristic parameters, or may be combined. It is only necessary to use these feature parameters in combination with the feature parameter threshold to determine the sound category to which the current signal frame belongs, without departing from the idea of the present invention.
  • the second embodiment provided by the present invention is a voice activity detecting method, and the main idea is: extracting feature parameters of a current signal frame; and determining, according to the feature parameter and the set parameter threshold, the current signal frame attribution Sound category.
  • the specific implementation process includes the following contents:
  • sequence framing processing, pre-emphasis processing, windowing processing, and fast Fourier transform FFT processing are sequentially performed on the current signal frame to obtain a corresponding frequency domain signal; and then the obtained current frequency domain signal is extracted.
  • the characteristic parameters of the frame are to enhance the spectrum of the input current signal frame, and the windowing process is to reduce the discontinuity of the signal at the beginning and end of the frame.
  • the noise parameter estimation value initialization process is not completed, the noise is strictly determined according to the characteristic parameter and the set noise parameter threshold:
  • the comparison result of the feature parameter and the set noise parameter threshold does not belong to the category of noise, determining that the current signal frame is a non-noise frame, calculating a Posterior SNR of the current signal frame, and utilizing The Posterior SNR adjusts a threshold of the set characteristic parameter.
  • the specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
  • the spectrum distance is less than the set spectral distance threshold, determining that the current signal frame is a noise frame, proceeding to perform silence determination according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, the current signal frame.
  • the signal energy is compared with a mute threshold. If it is less than the mute threshold, it is determined that the current signal frame is muted, and then the mute flag is output; if it is greater than the mute threshold, the current signal frame is not muted, but the noise frame is , then outputting a noise flag and utilizing the noise of the current frame Acoustic parameters update the noise parameter estimate;
  • the Posterior SNR of the current signal frame is calculated, and the threshold value of the threshold is determined by using the Posterior SNR to adjust the set feature parameter.
  • the specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
  • determining whether the current signal frame is voiced according to the voiced parameter threshold and the characteristic parameter of the current signal frame comparing the feature parameter of the current signal frame with the voiced parameter threshold, when the comparison result belongs to In the category of voiced sound, it is determined that the current signal frame is voiced; otherwise, it is determined that the current signal frame does not belong to voiced sound; and the current signal frame is determined according to the music parameter threshold and the characteristic parameter of the current signal frame.
  • Whether it is music comparing the feature parameter of the current signal frame with the music parameter threshold, and when the comparison result belongs to the category of music, determining that the current signal frame is music; otherwise, determining that the current signal frame is not Belongs to music.
  • the probability model is used to calculate the probability that the current signal frame belongs to voiced sound and music, and select The sound category corresponding to the large probability value is used as the attribution category of the current signal frame.
  • the specific implementation is similar to the related description in the first embodiment and will not be described in detail herein.
  • embodiments of the present invention extract feature parameters used in the classification process when voice activity detection is required, and thus do not depend on a specific coding.
  • the code algorithm is independent and easy to maintain and update.
  • the embodiment of the present invention determines the sound category to which the current signal frame belongs according to the extracted feature parameter and the set parameter threshold, and can divide the input narrowband audio or wideband audio digital signal into mute, noise, voiced,
  • the five types of unvoiced and music when applied in the field of speech coding technology, can not only serve as the basis for the newly developed variable rate audio coding algorithm and standard rate selection, but also provide a rate selection for existing coding standards without VAD algorithm.
  • the present invention can also be applied to other speech signal processing fields such as speech enhancement, speech recognition, speaker recognition, etc., and has strong versatility.
  • speech enhancement speech recognition
  • speaker recognition etc.
  • the present invention cover the modifications and the modifications of the invention

Abstract

A sound activity detecting method and sound activity detecting device, the core of the method and device is as follows: when the sound activity needs to be detected, the feature parameters of the current signal frame is extracted, the sound class of the current signal frame is determined according to the feature parameter and the setting parameter threshold value.

Description

声音活动检测方法和声音活动检测器 技术领域  Sound activity detection method and sound activity detector
本发明涉及语音信号处理技术, 尤其涉及声音活动检测方法和声音活动检 测器。 背景技术  The present invention relates to speech signal processing techniques, and more particularly to a voice activity detection method and a voice activity detector. Background technique
在语音信号处理领域, 存在一种对语音活动性进行检测的技术, 当其应用 在语音编码技术中, 通常称为语音活动检测 (Voice Activity Detection, VAD ), 当其应用在语音识别技术中, 通常称为语音端点检测 ( Speech Endpoint Detection ), 而当其应用在语音增强技术中, 则通常称为语音间隙检测 (Speech Pause Detection )。 针对不同的应用场景, 这些技术会有不同的侧重点, 会产生 不同的处理结果。 但是它们的本质都是用来检测语音通信时是否有语音存在, 检测结果的准确性直接影响着后续处理的质量, 比如语音编码、 语音识别和语 音增强。  In the field of speech signal processing, there is a technology for detecting voice activity. When it is applied in speech coding technology, it is usually called Voice Activity Detection (VAD). When it is applied in speech recognition technology, It is commonly referred to as Speech Endpoint Detection, and when it is used in speech enhancement technology, it is commonly referred to as Speech Pause Detection. These technologies have different focuses for different application scenarios and can produce different processing results. However, their essence is to detect the presence of voice during voice communication. The accuracy of the detection results directly affects the quality of subsequent processing, such as speech coding, speech recognition and speech enhancement.
语音活动检测技术主要针对输入到编码器内的语音信号而开发。 在语音编 码技术中, 将输入到编码器内的音频信号分为两种: 背景噪声和活动语音, 然 后对背景噪声和活动语音釆用不同的速率进行编码, 即对背景噪声用较低的速 率进行编码, 对活动语音用较高的速率进行编码, 从而达到降低通信的平均码 率, 促进变速率语音编码技术发展的目的。 但随着编码技术向多码率、 宽带方 向的发展, 输入编码器的信号呈多元化趋势, 即不仅限于语音, 还包含音乐和 各种噪声, 因此, 在对输入信号进行编码前, 需要对不同的输入信号进行区分, 以便能够釆用不同的码率, 甚至釆用不同的核心编码算法的编码器对不同的输 入信号进行编码。  Voice activity detection technology is primarily developed for speech signals input into the encoder. In speech coding technology, the audio signals input into the encoder are divided into two types: background noise and active speech, and then the background noise and the active speech are encoded at different rates, that is, the background noise is used at a lower rate. Coding, encoding the active speech at a higher rate, thereby reducing the average bit rate of communication and promoting the development of variable rate speech coding technology. However, with the development of coding technology to multi-code rate and wideband direction, the signal input to the encoder is diversified, that is, it is not limited to speech, but also includes music and various noises. Therefore, before encoding the input signal, it is necessary to Different input signals are differentiated so that different code rates can be used, and even different core coding algorithms are used to encode different input signals.
与本发明有关的现有技术一, 是第三代移动通信标准化伙伴项目 (the 3rd Generation Partnership Project, 3 GPP )组织制订的针对但不限于第三代移动通信 系统的多速率编码标准宽带自适应多速率语音编码器 ( Adaptive Multi-Rate - Wideband, AMR-WB+ ), 其有代数码本激励线性预测 (Algebraic Code Excited Linear Prediction, ACELP )和转换编码激发 ( Transform coded excitation, TCX ) 模式两种核心编码算法, ACELP模式适合于语音信号编码, TCX适合于包含音 乐的宽带信号, 因此两种模式的选择可以认为是语音与音乐的选择。 编码算法 中 ACELP和 TCX的模式选择方法有开环和闭环两种, 闭环选择是一种基于感知 加权信噪比的遍历搜索的选择方式, 与 VAD模块无关; 开环选择则是在釆用 AMR-WB +编码算法的 VAD模块的基础上,增加了特征参数的短时和长时统计, 并针对非语音特征进行了改进, 能在一定程度上实现语音和音乐的分类; 而且 当连续选择 ACELP模式的次数小于三次的情况下,仍会进行小规模的遍历搜索, 且由于分类时用到的特征参数均通过编码算法得到, 因此该方法与 AMR-WB+ 编码算法的耦合非常密切。 The prior art related to the present invention is a multi-rate coding standard broadband adaptation developed by the 3rd Generation Partnership Project (3GPP) for the third generation mobile communication system. Multi-rate speech coder ( Adaptive Multi-Rate - Wideband, AMR-WB+), which has two core coding algorithms: Algebraic Code Excited Linear Prediction (ACELP) and Transform Coded Excited (TCX) mode. The ACELP mode is suitable for speech signal coding. The TCX is suitable for wideband signals containing music, so the choice of the two modes can be considered as the choice of voice and music. The mode selection methods of ACELP and TCX in coding algorithm are open-loop and closed-loop. Closed-loop selection is a selection method based on perceptually weighted SNR for traversal search, which is independent of VAD module. Open-loop selection is based on AMR. Based on the VAD module of the WB+ encoding algorithm, the short-term and long-term statistics of the feature parameters are added, and the non-speech features are improved, and the classification of speech and music can be realized to a certain extent; and when ACELP is continuously selected When the number of modes is less than three times, a small-scale traversal search is still performed, and since the feature parameters used in the classification are obtained by the coding algorithm, the coupling of the method with the AMR-WB+ coding algorithm is very close.
与本发明有关的现有技术二,是第三代移动通信标准化伙伴项目 2( the Third Generation Partnership Project 2 , 3GPP2 )组织针对 CDMA2000系统制定的多码率 模式语音编码标准(Selectable Mode Vocoder, SMV ), 其有四种编码速率可供选 择, 分别为 9.6、 4.8、 2.4和 1.2kbps (实际净码率为 8.55、 4.0、 2.0和 0.8kbps ), 以 支持移动运营商在系统容量和语音质量之间灵活选择, 其算法中含有音乐检测 模块。 该模块利用 VAD模块计算出的部分参数来进一步计算音乐检测需要的参 数, 并在 VAD检测之后执行, 根据 VAD模块的输出判决, 以及所计算出的音乐 检测需要的参数进行补充判断, 输出音乐和非音乐的分类结果, 因此与编码算 法的耦合非常密切。  The second prior art related to the present invention is a multi-rate mode voice coding standard (SMV) developed by the third generation Mobile Communication Standardization Partnership Project 2 (3GPP2) for the CDMA2000 system. It has four encoding rates to choose from, 9.6, 4.8, 2.4, and 1.2 kbps (actual net rate of 8.55, 4.0, 2.0, and 0.8 kbps) to support mobile operators between system capacity and voice quality. Flexible choice, the algorithm contains a music detection module. The module further calculates the parameters required for the music detection by using the partial parameters calculated by the VAD module, and executes after the VAD detection, supplements the judgment according to the output decision of the VAD module, and the calculated parameters required for the music detection, and outputs the music and The result of non-musical classification is therefore very close to the coding algorithm.
由现有技术可以看出, 现有技术是在现有的语音编码标准中的 VAD技术的 基础上检测音乐信号的, 因此和编码算法密切相关, 即与编码器本身耦合性太 大, 独立性、 通用性和可维护性普遍比较差, 且在编码间的移植成本很高。  It can be seen from the prior art that the prior art detects the music signal based on the VAD technology in the existing speech coding standard, and thus is closely related to the encoding algorithm, that is, the coupling with the encoder itself is too large, and independence. Generality and maintainability are generally poor, and the cost of porting between coding is high.
另外, 现有的 VAD算法均是针对语音信号而开发, 所以只会将输入的音频 信号分为两种: 噪声和语音(非噪声), 即使包含音乐信号的检测, 也只是作为 VAD判决的一个修正和补充。 因此, 随着编解码算法应用场景从以处理语音为 主逐步过渡到处理多媒体语音 (包括多媒体音乐;), 编解码算法本身也逐步从窄 带到宽带扩展, 所以随着应用场景的变化, 现有 VAD算法的简单的输出类别显 然不足以描述各种各样的音频信号特性。 发明内容 In addition, the existing VAD algorithms are developed for speech signals, so only the input audio signals are divided into two types: noise and speech (non-noise), even if the detection of the music signal is included, it is only one of the VAD decisions. Amend and supplement. Therefore, as the codec algorithm application scenario gradually transitions from processing speech to processing multimedia speech (including multimedia music;), the codec algorithm itself is gradually narrower. Bringing to broadband extensions, so as the application scenario changes, the simple output categories of existing VAD algorithms are clearly insufficient to describe a wide variety of audio signal characteristics. Summary of the invention
本发明的实施例提供一种声音活动检测方法和声音活动检测器, 其能够独 立于编码算法提取信号的特征参数, 并利用所提取出的特征参数判断输入的信 号帧所归属的声音类别。  Embodiments of the present invention provide a voice activity detecting method and a voice activity detector that are capable of independently extracting feature parameters of a signal from an encoding algorithm and using the extracted feature parameters to determine a sound category to which the input signal frame belongs.
本发明的实施例通过如下的技术方案实现:  Embodiments of the present invention are implemented by the following technical solutions:
本发明的实施例提供一种声音活动检测方法, 其包括:  An embodiment of the present invention provides a voice activity detecting method, including:
在需要进行声音活动检测时, 提取当前信号帧中的特征参数;  Extracting characteristic parameters in the current signal frame when sound activity detection is required;
根据所述特征参数以及设定的参数阔值确定所述当前信号帧归属的声音类 别。  Determining, according to the characteristic parameter and the set parameter threshold, the sound category to which the current signal frame belongs.
本发明的实施例还提供一种声音活动检测器, 其包括:  An embodiment of the present invention also provides a sound activity detector, including:
特征参数提取模块, 用于在需要进行声音活动检测时, 提取当前信号帧中 的特征参数;  a feature parameter extraction module, configured to extract feature parameters in a current signal frame when sound activity detection is required;
信号类别判定模块, 用于根据所述特征参数以及设定的参数阔值确定所述 当前信号帧归属的声音类别。  And a signal class determining module, configured to determine, according to the feature parameter and the set parameter threshold, a sound category to which the current signal frame belongs.
由上述本发明提供的具体实施方案可以看出, 本发明的实施例是在需要进 行声音活动检测时提取判断输入的信号帧所归属的声音类别的过程所使用的特 征参数的, 因此不依赖于某一具体的编码算法, 独立进行, 方便了维护和更新。 附图说明  As can be seen from the specific embodiments provided by the present invention described above, the embodiment of the present invention extracts the feature parameters used in the process of determining the sound category to which the input signal frame belongs when the sound activity detection is required, and thus does not depend on A specific coding algorithm is performed independently, which facilitates maintenance and update. DRAWINGS
图 1为本发明提供的第一实施例的结构图;  Figure 1 is a structural view of a first embodiment of the present invention;
图 2为本发明提供的第一实施例中的信号预处理模块的工作原理图; 图 3 为本发明提供的第一实施例中的第一信号类别判定子模块的工作原理 图;  2 is a schematic diagram of the operation of the signal pre-processing module in the first embodiment of the present invention; FIG. 3 is a working principle diagram of the first signal class determining sub-module in the first embodiment provided by the present invention;
图 4 为本发明提供的第一实施例中的第二信号类别判定子模块判定非噪声 信号的类别时的工作原理图; 4 is a second signal class determining sub-module in the first embodiment of the present invention for determining non-noise Working principle diagram of the class of the signal;
图 5 为本发明提供的第一实施例中的第二信号类别判定子模块判定不确定 信号时的工作原理图。 具体实施方式  FIG. 5 is a schematic diagram showing the operation of the second signal class determining sub-module in determining the uncertain signal in the first embodiment provided by the present invention. detailed description
由于语音信号、 噪声信号和音乐信号在频谱上具有不同的分布特点, 而且 语音、 音乐和噪声序列的帧与帧之间的变化也都各自有各自的特点。 本发明的 实施例考虑首先基于这些信号帧的特点提取出各种音频信号的特征参数, 然后 根据这些特定参数对输入的窄带音频或宽带音频数字信号帧进行初级分类, 将 输入信号分为非噪声信号帧 (即有用信号, 包括语音和音乐)和噪声帧、 静音 信号帧。 然后对判为非噪声的信号帧进一步分为浊音、 清音和音乐信号帧。  Since speech signals, noise signals, and music signals have different distribution characteristics in the spectrum, and the variations between frames of frames of speech, music, and noise have their own characteristics. Embodiments of the present invention contemplate first extracting characteristic parameters of various audio signals based on characteristics of the signal frames, and then performing primary classification on the input narrowband audio or wideband audio digital signal frames according to the specific parameters, and dividing the input signals into non-noise. Signal frames (ie useful signals, including speech and music) and noise frames, mute signal frames. The signal frames that are judged to be non-noise are then further classified into voiced, unvoiced, and music signal frames.
本发明提供的第一实施例是一种声音活动检测器 (General Sound Activity Detection, GSAD ), 其结构如图 1 所示, 包括: 信号预处理模块、 特征参数提 取模块和信号类别判定模块。 其中, 所述信号类别判定模块包括第一信号类别 判定子模块和第二信号类别判定子模块。  The first embodiment of the present invention provides a general sound activity detection (GSAD), and its structure is as shown in FIG. 1, and includes: a signal preprocessing module, a feature parameter extraction module, and a signal class determination module. The signal class determination module includes a first signal class determination sub-module and a second signal class determination sub-module.
各个模块之间的信号传递关系如下:  The signal transfer relationship between each module is as follows:
输入信号帧进入所述信号预处理模块, 在此模块内对输入的数字声音信号 序列进行频语预加重和快速傅里叶变换( Fast Fourier Transform, FFT ) , 为下一 步特征参数提取做好准备。  The input signal frame enters the signal preprocessing module, and the input digital sound signal sequence is subjected to frequency pre-emphasis and fast Fourier transform (FFT) in the module to prepare for the next feature parameter extraction. .
信号经所述信号预处理模块处理后, 输入到所述特征参数提取模块以获得 特征参数。 为了降低系统的复杂度, GSAD的所有特征参数均在 FFT频谱上提 取。 另外在这一模块中, 还要提取和更新噪声参数, 来计算信号的信噪比, 以 控制一些判决阔值的更新。  After the signal is processed by the signal preprocessing module, it is input to the feature parameter extraction module to obtain a feature parameter. In order to reduce the complexity of the system, all the characteristic parameters of the GSAD are extracted on the FFT spectrum. In addition, in this module, the noise parameters are extracted and updated to calculate the signal-to-noise ratio of the signal to control the update of some decision thresholds.
在信号类别判定模块中, 首先通过第一信号类别判定子模块根据提取的特 征参数对所述信号预处理模块输入的信号帧进行初级分类, 将输入信号分为非 噪声信号 (即有用信号, 包括语音和音乐)和噪声、 静音信号。 然后在第二信 号类别判定子模块中, 对所述第一信号类别判定子模块判为非噪声的信号进一 步分为浊音、 清音和音乐信号。 这样通过两级分类, 给出最终的信号分类结果, 即噪声、 静音、 浊音、 清音和音乐。 In the signal class determining module, first, the first signal class determining sub-module performs primary classification on the signal frame input by the signal pre-processing module according to the extracted feature parameter, and divides the input signal into non-noise signals (ie, useful signals, including Voice and music) and noise, mute signals. Then, in the second signal category determining sub-module, the signal that is determined to be non-noise by the first signal class determining sub-module is further The steps are divided into voiced, unvoiced and musical signals. This gives the final signal classification results through two levels of classification, namely noise, mute, voiced, unvoiced and music.
下面对各个模块的具体处理过程进行描述, 如下:  The following describes the specific processing of each module, as follows:
一、 信号预处理模块  First, the signal preprocessing module
信号预处理模块的工作原理如图 2所示,对输入信号依次进行分帧、预加重、 加窗、 FFT变换等处理。  The working principle of the signal preprocessing module is shown in Fig. 2. The input signal is sequentially subjected to framing, pre-emphasis, windowing, FFT transformation and the like.
分帧: 对输入的数字声音信号序列进行分帧处理, 处理的帧长是 10ms, 帧 移也是 10ms, 即帧与帧之间无重叠。 若本实施例后续的处理系统, 如编码器的 处理帧长是 10ms的倍数, 则可以分成 10ms的声音帧进行处理。  Framing: The input digital audio signal sequence is framed. The processed frame length is 10ms, and the frame shift is also 10ms, that is, there is no overlap between frames. If the processing system subsequent to this embodiment, such as the processing frame length of the encoder is a multiple of 10 ms, it can be processed into a 10 ms sound frame.
预加重: 假设在 n时刻的声音釆样值为 x(n), 则经过预加重处理后得到的语 音釆样值 xp如公式 [1]所示:  Pre-emphasis: Assuming that the sound sample value at time n is x(n), the speech sample value xp obtained after the pre-emphasis processing is as shown in the formula [1]:
χρ {η) =
Figure imgf000007_0001
公式 [1] 其中, α(0.9<α<1.0)是预加重因子。
χ ρ {η) =
Figure imgf000007_0001
Equation [1] where α (0.9 < α < 1.0) is a pre-emphasis factor.
加窗: 加窗处理是为了减小帧起始和结束处的信号的不连续性, 其将预加重 处理后得到的语音釆样值 xp按帧与汉明 (hamming ) 窗相乘, 如公式 [2]所示: xw (n) = w(n) - xp (n) 公式 [2] 其中, (0≤w≤N— 1) ; 为 hamming窗函数: w(«) = 0.54 - 0.46 cos (— Windowing: The windowing process is to reduce the discontinuity of the signal at the beginning and end of the frame. It multiplies the speech sample value xp obtained by the pre-emphasis processing by the frame and the hamming window, as in the formula. [2]: x w (n) = w(n) - x p (n) Equation [2] where (0 ≤ w ≤ N - 1) ; is the hamming window function: w(«) = 0.54 - 0.46 cos (—
^ N_ ) 公式 [3] ^ N _ ) Formula [3]
其中, (0≤w≤N - 1) ; N为 hamming窗的窗长, 对应于不同釆样频率 N取不 同的值, 对于釆样频率分别为 8kHz和 16kHz的实施例, N分别是 80和 160。  Where (0 ≤ w ≤ N - 1); N is the window length of the hamming window, which takes different values corresponding to different sampling frequencies N. For the examples in which the sampling frequencies are 8 kHz and 16 kHz, respectively, N is 80 and 160.
FFT频语变换: 信号经过 hamming窗加窗处理后 , 进行标准的 FFT频谱变 换, 在 8kHz和 16kHz釆样率下时变换的窗长为 256, 不够的补零, 其它情况下 酌情变换。  FFT frequency conversion: After the signal is processed by hamming window, standard FFT spectrum conversion is performed. When 8kHz and 16kHz sample rate is used, the window length is 256, which is not enough to fill zero. In other cases, change as appropriate.
二、 特征参数提取模块  Second, the feature parameter extraction module
特征参数提取模块主要功能是提取输入信号的特征参数, 主要是频谱参数, 所述频谱参数包括: 短时特征参数及其类长时特征。 所述短时特征参数包括: 语波动 ( spectral flux ), 95%语衰减 ( spectral rolloff ),过零率 ( zero crossing rate, zcr), 帧内频谱方差, 低频信号带与全带能量比值; 所述类长时特征则是各短时 特征参数的方差和移动平均, 其统计的帧数在本发明的一个实施例中取 10帧, 即 100ms的时长。 The main function of the feature parameter extraction module is to extract the characteristic parameters of the input signal, mainly the spectral parameters. The spectral parameters include: short-term feature parameters and their class length characteristics. The short-term characteristic parameters include: spectral flux, 95% speech rolloff, zero crossing rate (zcr), intra-frame spectral variance, low-frequency signal band to full-band energy ratio; The long-term feature is the variance and moving average of each short-term feature parameter, and the number of statistical frames thereof is 10 frames in one embodiment of the present invention, that is, a duration of 100 ms.
下面给出这些特征参数的定义和计算公式。  The definitions and calculation formulas of these characteristic parameters are given below.
定义 χ(0表示一帧声音信号的第 i个时域釆样值,其中 0≤ <M; T表示帧数; M表示一帧信号的釆样值数目; N表示 FFT频语变换的窗长度; U_pw(k表示 信号当前帧 FFT变换后的频语在频率 k处的幅度值; var表示当前信号帧特征参 数的方差。 下面以釆样率 16kHz的声音信号为例, 对短时特征参数提取作详细 说明:  Definition χ (0 represents the i-th time domain sample value of a frame of sound signal, where 0 ≤ <M; T represents the number of frames; M represents the number of samples of a frame signal; N represents the window length of the FFT frequency conversion U_pw (k represents the amplitude value of the frequency of the current frame after the FFT transform of the signal at the frequency k; var represents the variance of the characteristic parameters of the current signal frame. The following is an example of the sound signal with a sample rate of 16 kHz, for short-term feature extraction For details:
1、 计算谱波动 (flux) 及其方差 (var_flux)  1. Calculate the spectrum fluctuation (flux) and its variance (var_flux)
谱波动 (flux) 的计算如公式 [4]所示: fluxii) = pw(k) -U _ pw(k - 1))2 公式 [4]Calculating spectral fluctuation (Flux) is shown in Equation [4] below: fluxii) = pw (k) -U _ pw (k - 1)) 2 Equation [4]
Figure imgf000008_0001
Figure imgf000008_0001
谱波动 (flux) 的方差 (var_flux) 的计算如公式 [5]所示:  The variance of the spectral fluctuation (var_flux) is calculated as shown in the formula [5]:
var_flux(i) =— ^ (flux(j) - flux(i)) 公式 [5]  Var_flux(i) =— ^ (flux(j) - flux(i)) Formula [5]
ΙΌ J='-W ΙΌ J='-W
其中, 当输入音频信号的釆样频率为 16kHz时, ;^()表示归一化可变谱波 动参数从第 i-10帧到第 i帧的均值。  Wherein, when the sampling frequency of the input audio signal is 16 kHz, ;^() represents the mean value of the normalized variable spectral wave parameter from the i-10th frame to the ith frame.
2、 计算 95%语衰减(rolloff) 以及 95%谱衰减的方差 ( rolloff_var ) rolloff表示由低频向高频累积的能量占全带能量 95%时的频率的位置, 具体 计算如公式 [6]:  2. Calculate the 95% decay (rolloff) and the variance of the 95% spectral decay (rolloff_var). Rolloff represents the position of the frequency at which the energy accumulated from the low frequency to the high frequency accounts for 95% of the total energy. The specific calculation is as in the formula [6]:
K k N  K k N
Rolloff = arg max (∑U_pw(i) < 0.95*^U_pwG) ) 公式 [6]  Rolloff = arg max (∑U_pw(i) < 0.95*^U_pwG) ) Formula [6]
k=l i=l j=l  k=l i=l j=l
95%语衰减(rolloff) 的方差 ( rolloff_var ) 的计算如公式 [7]所示: rolloff 公式 [7]The variance of the 95% language rolloff (rolloff_var) is calculated as in the formula [7]: rolloff formula [7]
Figure imgf000008_0002
其中, ro^ Z)表示 95%语衰减参数从第 i-10帧到第 i帧的均值- 3、 计算过零率(zcr): zcr=^II{x(/)x(z-l)<0} 公式 [8] 其中, Π{Α}的值由 A决定, 当 A是 truth时, Π{Α}的值为 1, 当 Α是 false时, II{A}的值为 0。
Figure imgf000008_0002
Where ro^ Z) represents the mean value of the 95% attenuation parameter from the i-10th frame to the ith frame - 3. Calculating the zero crossing rate ( zcr ): zcr=^II{x(/)x(zl)<0 } Equation [8] where Π{Α} is determined by A. When A is truth, Π{Α} has a value of 1. When Α is false, II{A} has a value of 0.
4、 计算帧内频谱幅度的方差 ( magvar ):  4. Calculate the variance of the spectral amplitude in the frame ( magvar ):
2 N 2 、 2 N 2 ,
mag var =— Σ, _pw(j)-U _pw) 公式 [9] 其中,
Figure imgf000009_0001
当前高频部分的频谱均值。
Mag var =— Σ, _pw(j)-U _pw) Equation [9] where,
Figure imgf000009_0001
The mean of the spectrum of the current high frequency portion.
5、 计算低频带占全频带的能量比值(ratiol ):  5. Calculate the energy ratio of the low frequency band to the full frequency band (ratiol):
ratio .公式[10]
Figure imgf000009_0002
Ratio . formula [10]
Figure imgf000009_0002
其中, Rl— Fl表示低频子带的下限 Rl— F2表示表示低频子带的上限。  Where R1 - F1 represent the lower limit of the low frequency sub-band Rl - F2 represents the upper limit representing the low frequency sub-band.
由上述可见, 在提取特征参数时, 是通过一个独立的模块来提取的, 并不 是在进行编码算法过程中提取的, 因此特征参数提取模块不依赖于任何现有的 编码器。 而且由于特征参数提取不依赖带宽, 从而使得 GSAD不依赖于信号釆 样率, 系统的可移植性大大增强。  It can be seen from the above that when the feature parameters are extracted, they are extracted by a separate module, which is not extracted during the encoding algorithm, so the feature parameter extraction module does not depend on any existing encoder. Moreover, since the feature parameter extraction does not depend on the bandwidth, so that the GSAD does not depend on the signal sampling rate, the portability of the system is greatly enhanced.
三、 第一信号类别判定子模块  Third, the first signal category decision sub-module
第一信号类别判定子模块的功能是将输入数字声音信号分成三类: 静音、 噪声信号和非噪声信号(即有用信号)。 其主要通过初始化噪声参数, 噪声判定 和噪声更新三部分完成, 在初始化噪声参数之前, 根据当前环境(语音 /音乐) 调整初始化过程的长时要求, 当前环境为语音时缩短初始化过程的长时要求, 当前环境为音乐时, 延长初始化过程的长时要求。 第一信号类别判定子模块的工作原理如图 3所示: The function of the first signal class decision sub-module is to classify the input digital sound signals into three categories: mute, noise signals, and non-noise signals (ie, useful signals). It is mainly completed by initializing noise parameters, noise determination and noise update. Before initializing the noise parameters, adjust the long-term requirements of the initialization process according to the current environment (speech/music), and shorten the long-term requirement of the initialization process when the current environment is voice. When the current environment is music, extend the long-term requirements of the initialization process. The working principle of the first signal class determination sub-module is shown in Figure 3:
首先, 获取当前帧的特征参数;  First, acquiring a feature parameter of the current frame;
然后, 判断噪声参数估计值初始化过程是否完成:  Then, it is judged whether the initialization process of the noise parameter estimation value is completed:
若没有完成噪声参数估计值初始化过程, 则根据当前信号帧的特征参数以 及噪声参数阈值对当前信号帧进行噪声严格判定: 将当前信号帧的特征参数与 所述噪声参数阔值比较, 并当比较结果属于噪声的范畴时, 则说明严格判定结 果是当前信号帧为噪声帧; 否则, 认为严格判定结果是当前帧为非噪声帧 (即 有用信号):  If the noise parameter estimation value initialization process is not completed, the current signal frame is strictly determined according to the characteristic parameters of the current signal frame and the noise parameter threshold: comparing the characteristic parameters of the current signal frame with the noise parameter threshold, and comparing When the result belongs to the category of noise, it indicates that the strict judgment result is that the current signal frame is a noise frame; otherwise, the strict judgment result is that the current frame is a non-noise frame (ie, a useful signal):
在进行噪声判定时, 可以釆用当前信号帧的频谱幅度的方差 magvar这一特 征参数与所述噪声参数阔值比较, 当当前信号帧的频谱幅度的方差 magvar小于 所述噪声参数阔值时, 则说明严格判定结果是当信号帧为噪声帧; 否则, 认为 严格判定结果是当前帧为非噪声帧 (即有用信号)。  When the noise determination is performed, the characteristic parameter of the variance of the spectral amplitude of the current signal frame, magvar, may be compared with the noise parameter threshold. When the variance of the spectral amplitude of the current signal frame is smaller than the noise parameter threshold, The strict decision result is that when the signal frame is a noise frame; otherwise, the strict decision result is that the current frame is a non-noise frame (ie, a useful signal).
如果严格判定结果是当前帧为非噪声帧,则输出非噪声标志,并使用公式 [11] 计算当前帧的信噪比( Signal-to-Noise Ratio, Posterior SNR )。计算出的 Posterior If the strict result is that the current frame is a non-noise frame, a non-noise flag is output, and the signal-to-noise ratio (Peer-to-Noise Ratio, Posterior SNR) of the current frame is calculated using Equation [11]. Calculated Posterior
SNR用于调整静音、 噪声、 清音、 浊音和音乐各特征参数的阔值。 SNR is used to adjust the threshold of various characteristic parameters of mute, noise, unvoiced, voiced and music.
∑ _ pw(k) ∑ _ pw(k)
PosteriorSNR = ^ 公式 [ 11 ] 其中 σ„表示噪声的方差, Κ为子带数。  PosteriorSNR = ^ Equation [ 11 ] where σ „ represents the variance of the noise and Κ is the number of subbands.
特征参数自适应调整和更新的目的是使判决流程在不同的信噪比条件下获 得相同的判决结果。 因为对同一段信号来说,在不同的信噪比(由 Posterior SNR 来反映) 下, 其相同特征参数的值是有所区别的, 也就是说信号的特征参数的 值受信噪比的影响。 因此, 若在不同的信噪比下达到相同的判决结果, 特征参 数的判决门限即阔值要根据当前信号帧的信噪比自适应地更新, 具体更新的方 式由相应特征参数受信噪比的实际影响而定。  The purpose of adaptive adjustment and updating of feature parameters is to enable the decision process to obtain the same decision result under different SNR conditions. Because for the same signal, under different signal-to-noise ratios (reflected by Posterior SNR), the values of the same characteristic parameters are different, that is, the value of the characteristic parameters of the signal is affected by the signal-to-noise ratio. . Therefore, if the same decision result is reached under different signal-to-noise ratios, the decision threshold of the feature parameter, that is, the threshold value, is adaptively updated according to the signal-to-noise ratio of the current signal frame, and the specific update mode is affected by the signal-to-noise ratio of the corresponding feature parameter. Depending on the actual impact.
如果严格判定结果是当前信号帧为噪声帧, 则继续根据所述当前信号帧的 特征参数以及静音参数阈值进行静音判定, 即将当前信号帧的信号能量与一个 静音阔值进行比较, 如果小于所述静音阔值, 则判定当前信号帧为静音, 于是 输出静音标志; 如果大于静音阔值, 则说明当前信号帧不为静音, 而是噪声帧, 于是输出噪声标志, 并根据当前噪声帧以及其之前的噪声帧初始化噪声参数估 计值, 同时记录当前判为噪声帧的信号帧的帧数; 当记录的信号帧数量到达噪 声参数估计值初始化需要的帧数量时, 则标志噪声参数估计值初始化过程完成。 其中, 在初始化噪声参数估计值时涉及噪声频谱的均值:^和方差 σ„, 其计算公 式分别如公式 [12]和公式 [13]所示: If the result of the strict determination is that the current signal frame is a noise frame, then the muting determination is continued according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, the signal energy of the current signal frame and one If the mute threshold is less than the mute threshold, it is determined that the current signal frame is muted, and then the mute flag is output; if it is greater than the mute threshold, the current signal frame is not muted, but a noise frame, and thus the output noise Marking, and initializing the noise parameter estimation value according to the current noise frame and the previous noise frame, and simultaneously recording the number of frames of the signal frame currently determined as the noise frame; when the number of recorded signal frames reaches the number of frames required for the initialization of the noise parameter estimation value Then, the initialization process of the flag noise parameter estimation value is completed. Among them, the mean value of the noise spectrum is involved in initializing the noise parameter estimation value: ^ and the variance σ„, and the calculation formulas are as shown in formula [12] and formula [13], respectively:
¾ = ^∑U_PW 公式 [12] =1 3⁄4 = ^∑U_PW formula [12] =1
T  T
^ = ^∑U_PW2 公式 [13] 公式 [12]和公式 [13]中的 U_PW是当前信号帧子带功率的矩阵向量。 ^ = ^ ΣU_PW 2 Equation [13] Equation [12] and the formula U_PW [13] is a matrix vector of the current frame subband power of the signal.
若完成了初始化噪声参数估计值过程, 则计算当前信号帧的特征参数与所 述噪声参数估计值的频谱距离; 并根据所述频谱距离进行噪声判定, 即将所计 算出的频谱距离与频谱距离阔值进行比较, 若计算出的频谱距离小于设定的频 谱距离阈值, 则继续根据所述当前信号帧的特征参数以及静音参数阈值进行静 音判定, 即将当前信号帧的信号能量与一个静音阔值进行比较, 如果小于所述 静音阔值, 则判定当前信号帧为静音, 于是输出静音标志; 如果大于静音阔值, 则说明当前信号帧不为静音, 而是噪声帧, 于是输出噪声标志, 并用当前信号 帧的频谱均值 和方差 ση更新所述噪声参数估计值, 并输出所述噪声参数估计 值。 其更新公式分别如公式 [ 14]和公式 [ 15]所示: If the process of initializing the noise parameter estimation value is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the estimated value of the noise parameter; and performing noise determination according to the spectral distance, that is, the calculated spectral distance and the spectral distance are wide Comparing the values, if the calculated spectral distance is less than the set spectral distance threshold, proceeding to perform silence determination according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, performing signal energy of the current signal frame and a mute threshold Comparing, if it is less than the mute threshold, it determines that the current signal frame is muted, and then outputs a mute flag; if it is greater than the mute threshold, it indicates that the current signal frame is not muted, but a noise frame, and then the noise flag is output, and the current noise flag is used. The spectral mean and variance σ η of the signal frame update the noise parameter estimate and output the noise parameter estimate. The update formulas are shown in equation [14] and formula [15], respectively:
Εη {ί) = {\ - β)Εη {ί - \) + βΕη {ί) 公式 [14] Ε η {ί) = {\ - β)Ε η {ί - \) + βΕ η {ί) Formula [14]
σ{ί) = {\ - α)ση {ί - \) + αση {ί) 公式 [ 15 ] σ{ί) = {\ - α)σ η {ί - \) + ασ η {ί) Formula [ 15 ]
如果计算出的频谱距离大于设定的频谱距离阔值, 则说明当前信号帧为非 噪声帧, 于是使用公式 [11]计算当前信号帧的 Posterior SNR, 并用当前计算出的 Posterior SNR调整信号的特征参数阔值, 并输出非噪声 (有用信号)标志。  If the calculated spectral distance is greater than the set spectral distance threshold, then the current signal frame is a non-noise frame, then the Posterior SNR of the current signal frame is calculated using Equation [11], and the characteristics of the signal are adjusted using the currently calculated Posterior SNR. The parameter is wide and outputs a non-noisy (useful signal) flag.
四、 第二信号类别判定子模块 若当前信号帧经第一信号类别判定子模块判决后, 如果其类型判为噪声帧, 则直接输出判决结果, 如果判决为非噪声帧, 则当前信号帧进入第二信号类别 判定子模块进行浊音、 清音和音乐信号的判决分类。 具体的判决可分两步进行, 第一步按照特征参数的特性对信号进行严格的判定, 将非噪声信号判为浊音、 清音、 音乐类, 使用的判定方式主要是硬判定(阔值判定)。 第二步主要针对既 属于浊音又属于音乐, 或者既不属于浊音又不属于音乐的不确定信号进行判决, 可以使用多种辅助判决方式, 比如釆用概率判决的方法, 即用概率模型分别计 算不确定信号属于浊音和音乐信号的概率, 将概率最大的作为不确定信号的最 终分类。 所述概率模型可以为高斯混合模型 GMM, 其参数是特征参数提取模块 提取的参数。 Fourth, the second signal category decision sub-module If the current signal frame is determined by the first signal class determination sub-module, if the type is judged as a noise frame, the decision result is directly output. If the decision is a non-noise frame, the current signal frame enters the second signal class determination sub-module for voiced sound. , the classification of unvoiced and musical signals. The specific decision can be made in two steps. The first step is to strictly judge the signal according to the characteristics of the characteristic parameters, and the non-noise signal is judged as voiced, unvoiced, and music. The judgment method used is mainly hard judgment (wide value judgment). . The second step is mainly for the uncertain signal that belongs to both voiced and music, or neither voiced nor music. It can use a variety of auxiliary judgment methods, such as the method of using probability judgment, that is, using probability model to calculate separately. The uncertainty signal belongs to the probability of voiced and musical signals, and the most probable is the final classification of the uncertain signal. The probability model may be a Gaussian mixture model GMM, and the parameters thereof are parameters extracted by the feature parameter extraction module.
第一步的判决流程如图 4 所示, 首先提取第一信号类别判定子模块输出的 非噪声帧的特征参数, 然后将所述非噪声信号帧的特征参数与清音参数阈值进 行比较:  The decision process of the first step is as shown in FIG. 4, first extracting the feature parameters of the non-noise frame output by the first signal class determination sub-module, and then comparing the feature parameters of the non-noise signal frame with the unvoiced parameter threshold:
若所述非噪声信号帧的特征参数与清音参数阔值的比较结果属于清音的范 畴, 则判定所述非噪声信号帧为清音, 并输出清音信号标志; 判决清音时使用 的特征参数可以是过零率 (ZCr ), 若过零率 (zcr ) 大于清音参数阔值, 则将所 述非噪声信号帧判定为清音, 并输出清音信号标志。 If the comparison result between the characteristic parameter of the non-noise signal frame and the unvoiced parameter threshold is in the category of unvoiced sound, determining that the non-noise signal frame is unvoiced, and outputting the unvoiced signal flag; the characteristic parameter used in determining the unvoiced sound may be Zero rate ( ZC r ), if the zero-crossing rate (zcr ) is greater than the unvoiced parameter threshold, the non-noise signal frame is determined to be unvoiced, and the unvoiced signal flag is output.
若所述非噪声信号帧的特征参数与清音参数阔值的比较结果不属于清音的 范畴, 则继续判定所述非噪声信号帧是否属于浊音, 若所述非噪声信号帧的特 征参数与所述浊音参数阔值的比较结果属于浊音的范畴, 则确定所述非噪声帧 属于浊音, 并设置浊音信号标志 = 1 ; 否则, 确定所述非噪声帧不属于浊音, 设 置浊音信号标志 = 0; 判定浊音时使用的特征参数可以是语波动 (flux )及其方 差(var— flux ), 若谱波动 (flux ) 大于与之相对应的浊音参数阔值, 或语波动方 差(var— flux )大于与之相对应的浊音参数阔值, 则将所述非噪声帧判定为浊音, 并设置浊音信号标志 = 1 ; 否则, 确定所述非噪声帧不属于浊音, 设置浊音信号 标志 = 0。  If the comparison result of the characteristic parameter of the non-noise signal frame and the unvoiced parameter threshold does not belong to the category of unvoiced sound, continue to determine whether the non-noise signal frame belongs to voiced sound, if the characteristic parameter of the non-noise signal frame is The comparison result of the voiced parameter threshold value belongs to the category of voiced sound, then it is determined that the non-noise frame belongs to voiced sound, and the voiced sound signal flag is set to 1; otherwise, it is determined that the non-noise frame is not voiced, and the voiced signal flag is set to 0; The characteristic parameters used in voiced sounds may be the flux and its variance (var-flux), if the spectral fluctuation is greater than the corresponding voiced parameter threshold, or the vari-variance (var-flux) is greater than The corresponding voiced parameter threshold value is determined as the voiced sound, and the voiced sound signal flag is set to 1; otherwise, it is determined that the non-noise frame is not voiced, and the voiced signal flag is set to 0.
若所述非噪声信号帧的特征参数与清音参数阔值的比较结果不属于清音的 范畴, 还要判定所述非噪声信号帧是否属于音乐的范畴, 若所述非噪声信号帧 的特征参数与所述音乐参数阔值的比较结果属于音乐的范畴, 则确定所述非噪 声帧属于音乐, 并设置音乐信号标志 = 1 ; 否则,确定所述非噪声帧不属于音乐, 并设置音乐信号标志 = 0。 判定音乐时使用的特征参数可以是谱波动方差 ( var flux )的移动平均( varmov_flux ) , 若 varmov_flux小于音乐参数阔值, 则 将所述非噪声帧判定为音乐, 并设置音乐信号标志 = 1 ; 否则, 确定所述非噪声 帧不属于音乐, 并设置音乐信号标志 = 0。 If the comparison between the characteristic parameter of the non-noise signal frame and the unvoiced parameter threshold is not unvoiced And determining whether the non-noise signal frame belongs to the category of music, and if the comparison result of the feature parameter of the non-noise signal frame and the music parameter threshold is in the category of music, determining that the non-noise frame belongs to Music, and set the music signal flag = 1; otherwise, it is determined that the non-noise frame does not belong to music, and the music signal flag = 0 is set. The characteristic parameter used in determining the music may be a moving average of the var flux (varmov_flux). If the varmov_flux is smaller than the music parameter threshold, the non-noise frame is determined as music, and the music signal flag is set to 1; Otherwise, it is determined that the non-noise frame does not belong to music, and the music signal flag = 0 is set.
若所述非噪声帧既属于浊音又属于音乐, 或者所述非噪声帧既不属于浊音 又不属于音乐, 那么将信号判为不确定类信号, 然后用第二步的辅助判决方法, 比如概率判断, 对不确定信号继续判决, 将其判为浊音或音乐的一种, 从而将 非噪声最终分为浊音、 清音和音乐。 以釆用概率判决的方式对不确定信号继续 判决为例进行说明, 具体如图 5所示:  If the non-noise frame belongs to both voiced and music, or the non-noise frame is neither voiced nor music, then the signal is judged as an indeterminate signal, and then the second step of the auxiliary decision method, such as probability Judging, the decision is continued on the uncertain signal, and it is judged as a kind of voiced sound or music, so that the non-noise is finally divided into voiced sound, unvoiced sound and music. The example of continuing decision on the uncertain signal by means of probabilistic judgment is illustrated as an example, as shown in Figure 5:
首先利用概率模型分别计算不确定信号帧属于浊音和音乐信号的概率, 并 将最大的概率值对应的声音类别作为不确定信号帧的最终分类; 然后修改所述 不确定信号帧的类型标志; 最后输出所述信号帧的类型标志。  Firstly, the probabilistic model is used to calculate the probability that the uncertain signal frame belongs to the voiced and music signals, and the sound category corresponding to the maximum probability value is used as the final classification of the uncertain signal frame; then the type flag of the uncertain signal frame is modified; A type flag of the signal frame is output.
在利用概率判决方法时, 还可以将所计算出的最大概率与设定概率阔值 pth 进行比较, 如果所计算出的最大概率超过所述概率阔值 pth , 则对所述非噪声帧 后续的信号帧进行拖尾处理; 否则, 不进行拖尾处理。  When the probability decision method is utilized, the calculated maximum probability may be compared with the set probability threshold pth, and if the calculated maximum probability exceeds the probability threshold pth, subsequent to the non-noise frame The signal frame is smeared; otherwise, no smearing is performed.
上述实施例中, 当判别当前信号帧归属的声音类别时, 所使用的特征参数 可以是上述列举的特征参数之一, 也可以为其组合。 只要利用这些特征参数与 特征参数阔值结合能够判断出当前信号帧归属的声音类别, 均不脱离本发明的 思想。  In the above embodiment, when discriminating the sound category to which the current signal frame belongs, the characteristic parameter used may be one of the above-listed characteristic parameters, or may be combined. It is only necessary to use these feature parameters in combination with the feature parameter threshold to determine the sound category to which the current signal frame belongs, without departing from the idea of the present invention.
本发明提供的第二实施例是一种声音活动检测方法, 其主要思想是: 提取 当前信号帧的特征参数; 并根据所述特征参数以及设定的参数阔值确定所述当 前信号帧归属的声音类别。 其具体实施过程包括如下内容:  The second embodiment provided by the present invention is a voice activity detecting method, and the main idea is: extracting feature parameters of a current signal frame; and determining, according to the feature parameter and the set parameter threshold, the current signal frame attribution Sound category. The specific implementation process includes the following contents:
首先, 对当前信号帧依次进行序列分帧处理、 预加重处理、 加窗处理和快 速傅立叶变换 FFT处理, 得到相应的频域信号; 然后提取得到的当前频域信号 帧的特征参数。 其中, 预加重处理是为了增强输入的当前信号帧的频谱, 加窗 处理是为了减小帧起始和结束处的信号的不连续性。 具体实现与第一实施例中 的相关描述雷同, 这里不再详细描述。 First, sequence framing processing, pre-emphasis processing, windowing processing, and fast Fourier transform FFT processing are sequentially performed on the current signal frame to obtain a corresponding frequency domain signal; and then the obtained current frequency domain signal is extracted. The characteristic parameters of the frame. The pre-emphasis processing is to enhance the spectrum of the input current signal frame, and the windowing process is to reduce the discontinuity of the signal at the beginning and end of the frame. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
然后, 判断是否完成噪声参数估计值初始化过程:  Then, it is judged whether the noise parameter estimation value initialization process is completed:
若未完成噪声参数估计值初始化过程, 则根据所述特征参数以及设定的噪 声参数阔值进行噪声严格判定:  If the noise parameter estimation value initialization process is not completed, the noise is strictly determined according to the characteristic parameter and the set noise parameter threshold:
将所述特征参数与所述设定的噪声参数阔值比较, 并当比较结果属于噪声 的范畴时, 则判定所述当前信号帧为噪声帧, 然后根据所述特征参数以及静音 参数阔值进行静音判定: 将所述特征参数与所述静音参数阔值比较, 当比较结 果属于静音的范畴时, 则判定所述当前信号帧为静音帧, 并输出相应的静音标 志; 否则, 判定当前信号帧为噪声帧, 并输出噪声帧标志, 根据所述当前噪声 帧及其之前的噪声帧计算噪声参数估计值; 并记录当前判为噪声帧的信号帧的 帧数; 当记录的信号帧数量到达噪声参数估计值初始化需要的帧数量时, 则标 志噪声参数估计值初始化过程完成。 具体实现与第一实施例中的相关描述雷同, 这里不再详细描述。  Comparing the characteristic parameter with the set noise parameter threshold, and when the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing according to the characteristic parameter and the mute parameter threshold Silence determination: comparing the feature parameter with the silence parameter threshold, when the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame, and outputting a corresponding silence flag; otherwise, determining a current signal frame a noise frame, and outputting a noise frame flag, calculating a noise parameter estimation value according to the current noise frame and the previous noise frame; and recording a frame number of the signal frame currently determined as a noise frame; when the number of recorded signal frames reaches the noise When the parameter estimation value initializes the required number of frames, the flag noise parameter estimation value initialization process is completed. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
当所述特征参数与所述设定的噪声参数阔值的比较结果不属于噪声的范畴 时, 则判定所述当前信号帧为非噪声帧, 则计算所述当前信号帧的 Posterior SNR, 并利用所述 Posterior SNR调整所述设定的特征参数的阔值。 具体实现与 第一实施例中的相关描述雷同, 这里不再详细描述。  When the comparison result of the feature parameter and the set noise parameter threshold does not belong to the category of noise, determining that the current signal frame is a non-noise frame, calculating a Posterior SNR of the current signal frame, and utilizing The Posterior SNR adjusts a threshold of the set characteristic parameter. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
当噪声参数估计值初始化过程完成后, 计算当前信号帧的特征参数与所述 噪声参数估计值之间的频谱距离, 然后根据所述频谱距离与设定的频谱距离阔 值, 对当前信号帧进行噪声判定:  After the noise parameter estimation value initialization process is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the noise parameter estimation value, and then performing the current signal frame according to the spectral distance and the set spectral distance threshold Noise determination:
若所述频谱距离小于设定的频谱距离阔值, 则判定所述当前信号帧为噪声 帧, 则继续根据所述当前信号帧的特征参数以及静音参数阔值进行静音判定, 即将当前信号帧的信号能量与一个静音阔值进行比较, 如果小于所述静音阔值, 则判定当前信号帧为静音, 于是输出静音标志; 如果大于静音阔值, 则说明当 前信号帧不为静音, 而是噪声帧, 于是输出噪声标志, 并利用所述当前帧的噪 声参数更新所述噪声参数估计值; If the spectrum distance is less than the set spectral distance threshold, determining that the current signal frame is a noise frame, proceeding to perform silence determination according to the characteristic parameter of the current signal frame and the silence parameter threshold, that is, the current signal frame. The signal energy is compared with a mute threshold. If it is less than the mute threshold, it is determined that the current signal frame is muted, and then the mute flag is output; if it is greater than the mute threshold, the current signal frame is not muted, but the noise frame is , then outputting a noise flag and utilizing the noise of the current frame Acoustic parameters update the noise parameter estimate;
否则, 判定所述当前信号帧为非噪声, 则计算所述当前信号帧的 Posterior SNR, 并利用所述 Posterior SNR调整设定的特征参数判决门限的阔值。 具体实 现与第一实施例中的相关描述雷同, 这里不再详细描述。  Otherwise, it is determined that the current signal frame is non-noise, then the Posterior SNR of the current signal frame is calculated, and the threshold value of the threshold is determined by using the Posterior SNR to adjust the set feature parameter. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
经过上述过程能够判断出输入的当前信号帧属于噪声、 静音和非噪声三类, 之后还要判定当前信号帧具体属于哪种非噪声类别, 具体如下:  After the above process, it can be judged that the input current signal frame belongs to three categories of noise, mute and non-noise, and then it is determined which specific non-noise category the current signal frame belongs to, as follows:
当当前信号帧为非噪声时, 根据清音参数阔值, 以及所述当前信号帧的特 征参数, 判定所述当前信号帧是否为清音:  When the current signal frame is non-noise, according to the unvoiced parameter threshold and the characteristic parameter of the current signal frame, it is determined whether the current signal frame is unvoiced:
将当前信号帧的特征参数与清音参数阔值比较, 当比较结果属于清音的范 畴时, 则判定所述当前信号帧为清音, 则输出相应的清音标志;  Comparing the characteristic parameter of the current signal frame with the unvoiced parameter threshold, and when the comparison result belongs to the unvoiced domain, determining that the current signal frame is unvoiced, then outputting the corresponding unvoiced flag;
否则, 根据浊音参数阔值, 以及所述当前信号帧的特征参数, 判定所述当 前信号帧是否为浊音: 将所述当前信号帧的特征参数与所述浊音参数阔值比较, 当比较结果属于浊音的范畴时, 则判定所述当前信号帧为浊音; 否则, 判定所 述当前信号帧不属于浊音; 并且根据音乐参数阔值, 以及所述当前信号帧的特 征参数, 判定所述当前信号帧是否为音乐: 将所述当前信号帧的特征参数与所 述音乐参数阔值比较, 当比较结果属于音乐的范畴时, 则判定所述当前信号帧 为音乐; 否则, 判定所述当前信号帧不属于音乐。 具体实现与第一实施例中的 相关描述雷同, 这里不再详细描述。  Otherwise, determining whether the current signal frame is voiced according to the voiced parameter threshold and the characteristic parameter of the current signal frame: comparing the feature parameter of the current signal frame with the voiced parameter threshold, when the comparison result belongs to In the category of voiced sound, it is determined that the current signal frame is voiced; otherwise, it is determined that the current signal frame does not belong to voiced sound; and the current signal frame is determined according to the music parameter threshold and the characteristic parameter of the current signal frame. Whether it is music: comparing the feature parameter of the current signal frame with the music parameter threshold, and when the comparison result belongs to the category of music, determining that the current signal frame is music; otherwise, determining that the current signal frame is not Belongs to music. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
当所述当前信号帧既属于浊音又属于音乐, 或, 当所述当前信号帧既不属 于浊音又不属于音乐时, 利用概率模型分别计算所述当前信号帧属于浊音和音 乐的概率, 并选择大的概率值对应的声音类别作为当前信号帧的归属类别。 具 体实现与第一实施例中的相关描述雷同, 这里不再详细描述。  When the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the probability model is used to calculate the probability that the current signal frame belongs to voiced sound and music, and select The sound category corresponding to the large probability value is used as the attribution category of the current signal frame. The specific implementation is similar to the related description in the first embodiment and will not be described in detail herein.
比较所述大的概率值与概率阔值, 当所述大的概率值大于所述概率阔值时, 则根据当前信号帧所归属的声音类别对当前信号帧后续一定数量的信号帧进行 拖尾处理。 具体实现与第一实施例中的相关描述雷同, 这里不再详细描述。  Comparing the large probability value with the probability threshold, when the large probability value is greater than the probability threshold, the tailing of a certain number of signal frames subsequent to the current signal frame according to the sound category to which the current signal frame belongs deal with. The specific implementation is similar to the related description in the first embodiment, and will not be described in detail herein.
由上述本发明提供的具体实施方案可以看出, 本发明的实施例在需要进行 声音活动检测时提取分类过程所使用的特征参数, 因此不依赖于某一具体的编 码算法, 独立进行, 方便了维护和更新。 另外, 本发明的实施例根据提取得到 的特征参数以及设定的参数阔值确定所述当前信号帧归属的声音类别, 能将输 入的窄带音频或宽带音频数字信号分为静音、 噪声、 浊音、 清音和音乐五类, 其应用在语音编码技术领域中时, 不仅能够作为新开发的变速率音频编码算法 和标准的速率选择依据, 还可以为现有没有 VAD算法的编码标准提供一个速率 选择的依据; 由于输出的信号类别比较多, 所以本发明还能够应用于语音增强、 语音识别、 说话人识别等其它语音信号处理领域, 具有很强的通用性。 明的精神和范围。 这样, 倘若本发明的这些修改和变型属于本发明权利要求及 其等同技术的范围之内, 则本发明也意图包含这些改动和变型在内。 As can be seen from the specific embodiments provided by the present invention described above, embodiments of the present invention extract feature parameters used in the classification process when voice activity detection is required, and thus do not depend on a specific coding. The code algorithm is independent and easy to maintain and update. In addition, the embodiment of the present invention determines the sound category to which the current signal frame belongs according to the extracted feature parameter and the set parameter threshold, and can divide the input narrowband audio or wideband audio digital signal into mute, noise, voiced, The five types of unvoiced and music, when applied in the field of speech coding technology, can not only serve as the basis for the newly developed variable rate audio coding algorithm and standard rate selection, but also provide a rate selection for existing coding standards without VAD algorithm. According to the invention, the present invention can also be applied to other speech signal processing fields such as speech enhancement, speech recognition, speaker recognition, etc., and has strong versatility. The spirit and scope of the Ming. Thus, it is intended that the present invention cover the modifications and the modifications of the invention

Claims

权 利 要 求 Rights request
1、 一种声音活动检测方法, 其特征在于, 该方法包括: A method for detecting a sound activity, the method comprising:
在需要进行声音活动检测时, 提取当前信号帧的特征参数;  Extracting characteristic parameters of the current signal frame when sound activity detection is required;
根据所述特征参数以及设定的参数阔值确定所述当前信号帧归属的声音类 别。  Determining, according to the characteristic parameter and the set parameter threshold, the sound category to which the current signal frame belongs.
2、 如权利要求 1所述的方法, 其特征在于, 在提取当前信号帧的特征参数 的过程之前包括:  2. The method of claim 1 wherein prior to the process of extracting characteristic parameters of the current signal frame comprises:
对当前信号帧依次进行序列分帧处理和快速傅立叶变换 FFT处理, 得到相 应的频 i或信号。  The current signal frame is sequentially subjected to sequence framing processing and fast Fourier transform FFT processing to obtain a corresponding frequency i or signal.
3、 如权利要求 2所述的方法, 其特征在于, 在提取当前信号帧的特征参数 之前还包括:  3. The method according to claim 2, further comprising: before extracting the feature parameters of the current signal frame:
对当前信号帧进行序列分帧处理后得到的信号帧, 进行预加重处理和 /或加 窗处理。  The signal frame obtained by performing sequence framing processing on the current signal frame is subjected to pre-emphasis processing and/or windowing processing.
4、 如权利要求 1所述的方法, 其特征在于, 所述根据所述特征参数以及设 定的参数阔值确定所述当前信号帧归属的声音类别的过程, 具体包括:  The method of claim 1, wherein the determining, according to the feature parameter and the set parameter threshold, the sound category to which the current signal frame belongs is specifically:
根据所述特征参数以及设定的参数阔值, 确定出所述当前信号帧归属的声 音类别为噪声帧、 静音帧或非噪声帧; 并当所述当前信号帧为非噪声帧时, 则 别。  Determining, according to the characteristic parameter and the set parameter threshold, a sound category to which the current signal frame belongs is a noise frame, a silence frame, or a non-noise frame; and when the current signal frame is a non-noise frame, .
5、 如权利要求 4所述的方法, 其特征在于, 根据所述特征参数以及设定的 参数阔值, 确定出所述当前信号帧归属的声音类别为噪声帧、 静音帧或非噪声 帧的过程, 具体包括:  The method according to claim 4, wherein, according to the feature parameter and the set parameter threshold, determining that the sound category to which the current signal frame belongs is a noise frame, a silence frame, or a non-noise frame. The process specifically includes:
当未完成噪声参数估计值初始化过程时, 根据所述特征参数以及噪声参数 阔值进行噪声严格判定:  When the noise parameter estimation value initialization process is not completed, the noise strict determination is performed according to the characteristic parameter and the noise parameter threshold:
将所述特征参数与噪声参数阔值比较, 若比较结果属于噪声的范畴, 则判 定所述当前信号帧为噪声帧, 然后根据所述特征参数以及静音参数阔值进行静 音判定: 将所述特征参数与所述静音参数阔值比较, 并当比较结果属于静音的 范畴时, 则判定所述当前信号帧为静音帧; 否则, 判定当前帧为噪声帧, 根据 所述当前噪声帧及其之前的噪声帧计算噪声参数估计值; Comparing the characteristic parameter with a noise parameter threshold, and if the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing static according to the characteristic parameter and the silence parameter threshold Sound determination: comparing the feature parameter with the silence parameter threshold, and when the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame; otherwise, determining that the current frame is a noise frame, according to the Calculating noise parameter estimates for the current noise frame and its previous noise frame;
将所述特征参数与所述设定的噪声参数阔值比较, 并当比较结果不属于噪 声的范畴时, 则判定所述当前信号帧为非噪声帧。  The feature parameter is compared with the set noise parameter threshold, and when the comparison result does not belong to the category of noise, the current signal frame is determined to be a non-noise frame.
6、 如权利要求 5所述的方法, 其特征在于, 该方法还包括:  6. The method of claim 5, further comprising:
当判定当前帧为噪声帧后, 记录当前判为噪声帧的信号帧的帧数; 当记录 的信号帧数量到达噪声参数估计值初始化需要的帧数量时, 则标志噪声参数估 计值初始化过程完成。  After determining that the current frame is a noise frame, the number of frames of the signal frame currently determined to be a noise frame is recorded; when the number of recorded signal frames reaches the number of frames required for initialization of the noise parameter estimation value, the initialization process of the flag noise parameter estimation value is completed.
7、 如权利要求 4所述的方法, 其特征在于, 所述根据所述特征参数以及设 定的参数阔值, 确定出所述当前信号帧归属的声音类别为噪声帧、 静音帧或非 噪声帧的过程, 具体包括:  The method according to claim 4, wherein the determining, according to the feature parameter and the set parameter threshold, that the sound category to which the current signal frame belongs is a noise frame, a silence frame, or a non-noise The process of the frame specifically includes:
当噪声参数估计值初始化过程完成后, 计算当前信号帧的特征参数与所述 噪声参数估计值之间的频谱距离, 然后根据所述频谱距离与设定的频谱距离阔 值, 对当前信号帧进行噪声判定:  After the noise parameter estimation value initialization process is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the noise parameter estimation value, and then performing the current signal frame according to the spectral distance and the set spectral distance threshold Noise determination:
将所述频谱距离与设定的频谱距离阔值比较, 并当比较结果属于噪声的范 畴时, 则判定所述当前信号帧为噪声帧, 然后根据所述特征参数以及静音参数 阔值进行静音判定: 将所述特征参数与所述静音参数阔值比较, 并当比较结果 属于静音的范畴时, 则判定所述当前信号帧为静音帧; 否则, 判定当前帧为噪 声帧, 并利用所述当前帧的信号参数更新所述噪声参数估计值;  Comparing the spectral distance with the set spectral distance threshold, and when the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing silence determination according to the characteristic parameter and the silence parameter threshold : comparing the feature parameter with the silence parameter threshold, and when the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame; otherwise, determining that the current frame is a noise frame, and using the current The signal parameter of the frame updates the noise parameter estimate;
否则, 判定所述当前信号帧为非噪声帧。  Otherwise, it is determined that the current signal frame is a non-noise frame.
8、 如权利要求 5或 7所述的方法, 其特征在于, 该方法还包括:  8. The method according to claim 5 or 7, wherein the method further comprises:
当判定当前信号帧为非噪声时, 计算所述当前信号帧的信噪比 Posterior SNR, 并利用所述 Posterior SNR调整设定的特征参数的阔值。  When it is determined that the current signal frame is non-noise, the signal-to-noise ratio Posterior SNR of the current signal frame is calculated, and the threshold value of the set characteristic parameter is adjusted by using the Posterior SNR.
9、 如权利要求 4所述的方法, 其特征在于, 当当前信号帧为非噪声帧时, 根据所述特征参数以及设定的参数阔值确定出所述当前信号帧归属的声音类别 的过程包括: 根据清音参数阔值, 以及所述当前信号帧的特征参数, 判定所述当前信号 帧是否为清音: The method according to claim 4, wherein when the current signal frame is a non-noise frame, the process of determining the sound category to which the current signal frame belongs according to the feature parameter and the set parameter threshold include: Determining whether the current signal frame is unvoiced according to the unvoiced parameter threshold and the characteristic parameter of the current signal frame:
将当前信号帧的特征参数与清音参数阔值比较, 并当比较结果属于清音的 范畴时, 则判定所述当前信号帧为清音;  Comparing the characteristic parameter of the current signal frame with the unvoiced parameter threshold, and when the comparison result belongs to the unvoiced category, determining that the current signal frame is unvoiced;
否则, 根据浊音参数阔值, 以及所述当前信号帧的特征参数, 判定所述当 前信号帧是否为浊音: 将所述当前信号帧的特征参数与所述浊音参数阔值比较, 当比较结果属于浊音的范畴时, 则判定所述当前信号帧为浊音; 否则, 判定所 述当前信号帧不属于浊音; 并且根据音乐参数阔值, 以及所述当前信号帧的特 征参数, 判定所述当前信号帧是否为音乐: 将所述当前信号帧的特征参数与所 述音乐参数阔值比较, 并当比较结果属于音乐的范畴时, 则判定所述当前信号 帧为音乐; 否则, 判定所述当前信号帧不属于音乐。  Otherwise, determining whether the current signal frame is voiced according to the voiced parameter threshold and the characteristic parameter of the current signal frame: comparing the feature parameter of the current signal frame with the voiced parameter threshold, when the comparison result belongs to In the category of voiced sound, it is determined that the current signal frame is voiced; otherwise, it is determined that the current signal frame does not belong to voiced sound; and the current signal frame is determined according to the music parameter threshold and the characteristic parameter of the current signal frame. Whether it is music: comparing the feature parameter of the current signal frame with the music parameter threshold, and when the comparison result belongs to the category of music, determining that the current signal frame is music; otherwise, determining the current signal frame Not a music.
10、 如权利要求 9 所述的方法, 其特征在于, 当所述当前信号帧既属于浊 音又属于音乐, 或, 当所述当前信号帧既不属于浊音又不属于音乐时, 所述根 据所述特征参数以及设定的参数阔值确定所述当前信号帧归属的声音类别的过 程还包括:  10. The method according to claim 9, wherein when the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the basis is The process of determining the sound category to which the current signal frame belongs by the characteristic parameter and the set parameter threshold further includes:
利用概率模型分别计算所述当前信号帧属于浊音和音乐的概率, 并选择大 的概率值对应的声音类别作为当前信号帧的归属类别。  The probability model is used to calculate the probability that the current signal frame belongs to voiced sound and music, and the sound category corresponding to the large probability value is selected as the attribution category of the current signal frame.
11、 如权利要求 10所述的方法, 其特征在于, 当所述当前信号帧既属于浊 音又属于音乐, 或, 当所述当前信号帧既不属于浊音又不属于音乐时, 所述根 据所述特征参数以及设定的参数阔值确定所述当前信号帧归属的声音类别的过 程还包括:  The method according to claim 10, wherein when the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the base The process of determining the sound category to which the current signal frame belongs by the characteristic parameter and the set parameter threshold further includes:
比较所述大的概率值与概率阔值, 当所述大的概率值大于所述概率阔值时, 则根据当前信号帧所归属的声音类别对当前信号帧后续一定数量的信号帧进行 拖尾处理。  Comparing the large probability value with the probability threshold, when the large probability value is greater than the probability threshold, the tailing of a certain number of signal frames subsequent to the current signal frame according to the sound category to which the current signal frame belongs deal with.
12、 一种声音活动检测器, 其特征在于, 该声音活动检测器包括: 特征参数提取模块, 用于在需要进行声音活动检测时, 提取当前信号帧的 特征参数; 信号类别判定模块, 用于根据所述特征参数以及设定的参数阔值确定所述 当前信号帧归属的声音类别。 12. The sound activity detector, wherein the sound activity detector comprises: a feature parameter extraction module, configured to extract a feature parameter of a current signal frame when the sound activity detection is required; And a signal category determining module, configured to determine, according to the feature parameter and the set parameter threshold, a sound category to which the current signal frame belongs.
13、如权利要求 12所述的检测器, 其特征在于, 该声音活动检测器还包括: 信号预处理模块, 用于对当前信号帧依次进行序列分帧处理和快速傅立叶 变换 FFT处理, 并得到相应的频域信号提供给所述特征参数提取模块以及所述 信号类别判定模块。  The detector of claim 12, wherein the sound activity detector further comprises: a signal preprocessing module, configured to sequentially perform sequence framing processing and fast Fourier transform FFT processing on the current signal frame, and obtain A corresponding frequency domain signal is provided to the feature parameter extraction module and the signal class determination module.
14、 如权利要求 13所述的检测器, 其特征在于, 所述信号预处理模块还用 于:  14. The detector of claim 13, wherein the signal pre-processing module is further configured to:
对当前信号帧进行序列分帧处理后得到的信号帧, 进行预加重处理和 /或加 窗处理。  The signal frame obtained by performing sequence framing processing on the current signal frame is subjected to pre-emphasis processing and/or windowing processing.
15、 如权利要求 12所述的检测器, 其特征在于, 所述信号类别判定模块包 括:  The detector of claim 12, wherein the signal class determination module comprises:
第一信号类别判定子模块 , 用于当未完成噪声参数估计值初始化过程时 , 根据所述特征参数以及设定的噪声参数阔值进行噪声严格判定:  The first signal category determining sub-module is configured to perform strict noise determination according to the characteristic parameter and the set noise parameter threshold value when the noise parameter estimation value initializing process is not completed:
若所述特征参数与所述设定的噪声参数阔值比较, 比较结果属于噪声的范 畴, 则判定所述当前信号帧为噪声帧, 然后根据所述特征参数以及静音参数阔 值进行静音判定, 若所述特征参数与所述静音参数阈值比较, 比较结果属于静 音的范畴, 则判定所述当前信号帧为静音帧; 否则, 判定当前帧为噪声帧, 根 据所述当前噪声帧及其之前的噪声帧计算噪声参数估计值;  If the feature parameter is compared with the set noise parameter threshold, and the comparison result belongs to the category of noise, determining that the current signal frame is a noise frame, and then performing silence determination according to the feature parameter and the silence parameter threshold. If the feature parameter is compared with the silence parameter threshold, and the comparison result belongs to the category of silence, determining that the current signal frame is a silence frame; otherwise, determining that the current frame is a noise frame, according to the current noise frame and the previous Noise frame calculation noise parameter estimation value;
若所述特征参数与所述设定的噪声参数阔值比较, 比较结果不属于噪声的 范畴, 则判定所述当前信号帧为非噪声帧。  If the feature parameter is compared with the set noise parameter threshold, and the comparison result does not belong to the category of noise, then the current signal frame is determined to be a non-noise frame.
16、 如权利要求 15所述的检测器, 其特征在于, 所述第一信号类别判定子 模块还用于:  The detector according to claim 15, wherein the first signal class determining sub-module is further configured to:
记录当前判为噪声帧的信号帧的帧数; 当记录的信号帧数量到达噪声参数 估计值初始化需要的帧数量时, 则标志噪声参数估计值初始化过程完成。  The number of frames of the signal frame currently determined as the noise frame is recorded; when the number of recorded signal frames reaches the number of frames required for the initialization of the noise parameter estimation value, the initialization process of the flag noise parameter estimation value is completed.
17、 如权利要求 15所述的检测器, 其特征在于, 所述第一信号类别判定子 模块还用于: 当噪声参数估计值初始化过程完成后, 计算当前信号帧的特征参数与所述 噪声参数估计值之间的频谱距离, 然后根据所述频谱距离与设定的频谱距离阔 值, 对当前信号帧进行噪声判定: The detector according to claim 15, wherein the first signal class determining sub-module is further configured to: After the noise parameter estimation value initialization process is completed, calculating a spectral distance between the characteristic parameter of the current signal frame and the noise parameter estimation value, and then performing the current signal frame according to the spectral distance and the set spectral distance threshold Noise determination:
将所述频谱距离与设定的频谱距离阔值比较, 当比较结果属于噪声的范畴 时, 根据所述特征参数以及静音参数阈值进行静音判定: 将所述特征参数与所 述静音参数阔值比较, 并当比较结果属于静音的范畴时, 则判定所述当前信号 帧为静音帧; 否则, 判定所述当前信号帧为噪声帧, 利用所述当前帧的噪声参 数更新所述噪声参数估计值;  Comparing the spectral distance with a set spectral distance threshold, and when the comparison result belongs to the category of noise, performing a silence determination according to the characteristic parameter and the silence parameter threshold: comparing the characteristic parameter with the silence parameter threshold And determining that the current signal frame is a silence frame when the comparison result belongs to the category of silence; otherwise, determining that the current signal frame is a noise frame, and updating the noise parameter estimation value by using a noise parameter of the current frame;
否则, 判定所述当前信号帧为非噪声。  Otherwise, it is determined that the current signal frame is non-noise.
18、 如权利要求 15或 17所述的检测器, 其特征在于, 所述第一信号类别 判定子模块还用于:  The detector according to claim 15 or 17, wherein the first signal class determining sub-module is further configured to:
当判定当前信号帧为非噪声时, 计算所述当前信号帧的信噪比 Posterior SNR, 并利用所述 Posterior SNR调整设定的特征参数的阔值。  When it is determined that the current signal frame is non-noise, the signal-to-noise ratio Posterior SNR of the current signal frame is calculated, and the threshold value of the set characteristic parameter is adjusted by using the Posterior SNR.
19、 如权利要求 18所述的检测器, 其特征在于, 所述信号类别判定模块还 包括:  The detector of claim 18, wherein the signal class determination module further comprises:
第二信号类别判定子模块, 用于当当前信号帧为非噪声时, 根据清音参数 阔值, 以及所述当前信号帧的特征参数, 判定所述当前信号帧是否为清音: 将当前信号帧的特征参数与清音参数阔值比较, 当比较结果属于清音的范 畴时, 则判定所述当前信号帧为清音; 否则, 根据浊音参数阔值, 以及所述当 前信号帧的特征参数, 判定所述当前信号帧是否为浊音:  a second signal class determining submodule, configured to determine, according to the unvoiced parameter threshold, and the characteristic parameter of the current signal frame, whether the current signal frame is unvoiced when the current signal frame is non-noise: The characteristic parameter is compared with the unvoiced parameter threshold. When the comparison result belongs to the unvoiced category, the current signal frame is determined to be unvoiced; otherwise, the current parameter is determined according to the voiced parameter threshold and the characteristic parameter of the current signal frame. Whether the signal frame is voiced:
将所述当前信号帧的特征参数与所述浊音参数阔值比较, 当比较结果属于 浊音的范畴时, 则判定所述当前信号帧为浊音; 否则, 判定所述当前信号帧不 属于浊音; 并且根据音乐参数阔值, 以及所述当前信号帧的特征参数, 判定所 述当前信号帧是否为音乐: 将所述当前信号帧的特征参数与所述音乐参数阔值 比较, 当比较结果属于音乐的范畴时, 则判定所述当前信号帧为音乐; 否则判 定所述当前信号帧不属于音乐。  Comparing the characteristic parameter of the current signal frame with the voiced parameter threshold, and when the comparison result belongs to the category of voiced sound, determining that the current signal frame is voiced; otherwise, determining that the current signal frame is not voiced; Determining, according to a music parameter threshold, and a characteristic parameter of the current signal frame, whether the current signal frame is music: comparing a feature parameter of the current signal frame with a threshold value of the music parameter, when the comparison result belongs to music In the case of the category, it is determined that the current signal frame is music; otherwise, it is determined that the current signal frame does not belong to music.
20、 如权利要求 19所述的检测器, 其特征在于, 所述第二信号类别判定子 模块还用于: The detector according to claim 19, wherein said second signal class determiner The module is also used to:
当所述当前信号帧既属于浊音又属于音乐, 或, 当所述当前信号帧既不属 于浊音又不属于音乐时, 利用概率模型分别计算所述当前信号帧属于浊音和音 乐的概率, 并选择大的概率值对应的声音类别作为当前信号帧的归属类别。  When the current signal frame belongs to both voiced and music, or when the current signal frame is neither voiced nor music, the probability model is used to calculate the probability that the current signal frame belongs to voiced sound and music, and select The sound category corresponding to the large probability value is used as the attribution category of the current signal frame.
21、 如权利要求 20所述的检测器, 其特征在于, 所述第二信号类别判定子 模块还用于:  The detector according to claim 20, wherein the second signal category determining sub-module is further configured to:
比较所述大的概率值与概率阔值, 当所述大的概率值大于所述概率阔值时, 则根据当前信号帧所归属的声音类别对当前信号帧后续一定数量的信号帧进行 拖尾处理。  Comparing the large probability value with the probability threshold, when the large probability value is greater than the probability threshold, the tailing of a certain number of signal frames subsequent to the current signal frame according to the sound category to which the current signal frame belongs deal with.
PCT/CN2007/003364 2006-12-07 2007-11-28 Sound activity detecting method and sound activity detecting device WO2008067719A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 200610161143 CN101197130B (en) 2006-12-07 2006-12-07 Sound activity detecting method and detector thereof
CN200610161143.6 2006-12-07

Publications (1)

Publication Number Publication Date
WO2008067719A1 true WO2008067719A1 (en) 2008-06-12

Family

ID=39491655

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/003364 WO2008067719A1 (en) 2006-12-07 2007-11-28 Sound activity detecting method and sound activity detecting device

Country Status (2)

Country Link
CN (1) CN101197130B (en)
WO (1) WO2008067719A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8447601B2 (en) 2009-10-15 2013-05-21 Huawei Technologies Co., Ltd. Method and device for tracking background noise in communication system
CN111768801A (en) * 2020-06-12 2020-10-13 瑞声科技(新加坡)有限公司 Airflow noise eliminating method and device, computer equipment and storage medium
CN112992188A (en) * 2012-12-25 2021-06-18 中兴通讯股份有限公司 Method and device for adjusting signal-to-noise ratio threshold in VAD (voice over active) judgment

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101625862B (en) * 2008-07-10 2012-07-18 新奥特(北京)视频技术有限公司 Method for detecting voice interval in automatic caption generating system
CN101625859B (en) * 2008-07-10 2012-06-06 新奥特(北京)视频技术有限公司 Method for determining waveform slope threshold of short-time energy frequency values in voice endpoint detection
US8380497B2 (en) * 2008-10-15 2013-02-19 Qualcomm Incorporated Methods and apparatus for noise estimation
CN101458943B (en) * 2008-12-31 2013-01-30 无锡中星微电子有限公司 Sound recording control method and sound recording device
CN102044242B (en) 2009-10-15 2012-01-25 华为技术有限公司 Method, device and electronic equipment for voice activation detection
CN102714034B (en) * 2009-10-15 2014-06-04 华为技术有限公司 Signal processing method, device and system
CN102044246B (en) * 2009-10-15 2012-05-23 华为技术有限公司 Method and device for detecting audio signal
CN101895373B (en) * 2010-07-21 2014-05-07 华为技术有限公司 Channel decoding method, system and device
CN101968957B (en) * 2010-10-28 2012-02-01 哈尔滨工程大学 Voice detection method under noise condition
US9111531B2 (en) * 2012-01-13 2015-08-18 Qualcomm Incorporated Multiple coding mode signal classification
US9099098B2 (en) * 2012-01-20 2015-08-04 Qualcomm Incorporated Voice activity detection in presence of background noise
CN103578477B (en) * 2012-07-30 2017-04-12 中兴通讯股份有限公司 Denoising method and device based on noise estimation
TWI612518B (en) 2012-11-13 2018-01-21 三星電子股份有限公司 Encoding mode determination method , audio encoding method , and audio decoding method
CN103065631B (en) 2013-01-24 2015-07-29 华为终端有限公司 A kind of method of speech recognition, device
CN103971680B (en) 2013-01-24 2018-06-05 华为终端(东莞)有限公司 A kind of method, apparatus of speech recognition
CN103646649B (en) * 2013-12-30 2016-04-13 中国科学院自动化研究所 A kind of speech detection method efficiently
CN111312277B (en) 2014-03-03 2023-08-15 三星电子株式会社 Method and apparatus for high frequency decoding of bandwidth extension
CN107086043B (en) * 2014-03-12 2020-09-08 华为技术有限公司 Method and apparatus for detecting audio signal
EP3913628A1 (en) 2014-03-24 2021-11-24 Samsung Electronics Co., Ltd. High-band encoding method
US9697843B2 (en) * 2014-04-30 2017-07-04 Qualcomm Incorporated High band excitation signal generation
CN104409080B (en) * 2014-12-15 2018-09-18 北京国双科技有限公司 Sound end detecting method and device
CN105810201B (en) * 2014-12-31 2019-07-02 展讯通信(上海)有限公司 Voice activity detection method and its system
US9886963B2 (en) * 2015-04-05 2018-02-06 Qualcomm Incorporated Encoder selection
US10186276B2 (en) * 2015-09-25 2019-01-22 Qualcomm Incorporated Adaptive noise suppression for super wideband music
CN106571146B (en) 2015-10-13 2019-10-15 阿里巴巴集团控股有限公司 Noise signal determines method, speech de-noising method and device
CN105609118B (en) * 2015-12-30 2020-02-07 生迪智慧科技有限公司 Voice detection method and device
CN107305774B (en) 2016-04-22 2020-11-03 腾讯科技(深圳)有限公司 Voice detection method and device
CN106354277A (en) * 2016-09-21 2017-01-25 成都创慧科达科技有限公司 Method and system for rapidly inputting phrases and sentences
CN106653047A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Automatic gain control method and device for audio data
CN108242241B (en) * 2016-12-23 2021-10-26 中国农业大学 Pure voice rapid screening method and device thereof
CN107425906B (en) * 2017-07-25 2019-09-27 电子科技大学 Distributing optical fiber sensing signal processing method towards underground pipe network safety monitoring
CN107436451B (en) * 2017-07-26 2019-10-11 西安交通大学 A kind of amplitude spectral method of automatic calculating seismic data optical cable coupled noise degree of strength
CN107657961B (en) * 2017-09-25 2020-09-25 四川长虹电器股份有限公司 Noise elimination method based on VAD and ANN
CN107833579B (en) * 2017-10-30 2021-06-11 广州酷狗计算机科技有限公司 Noise elimination method, device and computer readable storage medium
CN107993071A (en) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 Electronic device, auth method and storage medium based on vocal print
CN109994129B (en) * 2017-12-29 2023-10-20 阿里巴巴集团控股有限公司 Speech processing system, method and device
CN108831508A (en) * 2018-06-13 2018-11-16 百度在线网络技术(北京)有限公司 Voice activity detection method, device and equipment
CN110085264B (en) * 2019-04-30 2021-10-15 北京如布科技有限公司 Voice signal detection method, device, equipment and storage medium
WO2021041568A1 (en) * 2019-08-27 2021-03-04 Dolby Laboratories Licensing Corporation Dialog enhancement using adaptive smoothing
CN110689905B (en) * 2019-09-06 2021-12-21 西安合谱声学科技有限公司 Voice activity detection system for video conference system
CN110890104B (en) * 2019-11-26 2022-05-03 思必驰科技股份有限公司 Voice endpoint detection method and system
CN111105815B (en) * 2020-01-20 2022-04-19 深圳震有科技股份有限公司 Auxiliary detection method and device based on voice activity detection and storage medium
CN111369982A (en) * 2020-03-13 2020-07-03 北京远鉴信息技术有限公司 Training method of audio classification model, audio classification method, device and equipment
CN112397086A (en) * 2020-11-05 2021-02-23 深圳大学 Voice keyword detection method and device, terminal equipment and storage medium
CN115334349B (en) * 2022-07-15 2024-01-02 北京达佳互联信息技术有限公司 Audio processing method, device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4720862A (en) * 1982-02-19 1988-01-19 Hitachi, Ltd. Method and apparatus for speech signal detection and classification of the detected signal into a voiced sound, an unvoiced sound and silence
US5101434A (en) * 1987-09-01 1992-03-31 King Reginald A Voice recognition using segmented time encoded speech
CN1204766A (en) * 1997-03-25 1999-01-13 皇家菲利浦电子有限公司 Method and device for detecting voice activity
CN1354455A (en) * 2000-11-18 2002-06-19 深圳市中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment
CN1447963A (en) * 2000-08-21 2003-10-08 康奈克森特系统公司 Method for noise robust classification in speech coding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4720862A (en) * 1982-02-19 1988-01-19 Hitachi, Ltd. Method and apparatus for speech signal detection and classification of the detected signal into a voiced sound, an unvoiced sound and silence
US5101434A (en) * 1987-09-01 1992-03-31 King Reginald A Voice recognition using segmented time encoded speech
CN1204766A (en) * 1997-03-25 1999-01-13 皇家菲利浦电子有限公司 Method and device for detecting voice activity
CN1447963A (en) * 2000-08-21 2003-10-08 康奈克森特系统公司 Method for noise robust classification in speech coding
CN1354455A (en) * 2000-11-18 2002-06-19 深圳市中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BAI L. ET AL.: "Feature Analysis and Extraction for Audio Automatic Classification", MINI-MICRO SYSTEMS, vol. 26, no. 11, November 2005 (2005-11-01), pages 2029 - 2034 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8447601B2 (en) 2009-10-15 2013-05-21 Huawei Technologies Co., Ltd. Method and device for tracking background noise in communication system
CN112992188A (en) * 2012-12-25 2021-06-18 中兴通讯股份有限公司 Method and device for adjusting signal-to-noise ratio threshold in VAD (voice over active) judgment
CN111768801A (en) * 2020-06-12 2020-10-13 瑞声科技(新加坡)有限公司 Airflow noise eliminating method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN101197130B (en) 2011-05-18
CN101197130A (en) 2008-06-11

Similar Documents

Publication Publication Date Title
WO2008067719A1 (en) Sound activity detecting method and sound activity detecting device
CA2690433C (en) Method and device for sound activity detection and sound signal classification
JP5425682B2 (en) Method and apparatus for robust speech classification
TWI591621B (en) Method of quantizing linear predictive coding coefficients, sound encoding method, method of de-quantizing linear predictive coding coefficients, sound decoding method, and recording medium
KR100964402B1 (en) Method and Apparatus for determining encoding mode of audio signal, and method and appartus for encoding/decoding audio signal using it
JP5325292B2 (en) Method and identifier for classifying different segments of a signal
US8396707B2 (en) Method and device for efficient quantization of transform information in an embedded speech and audio codec
JP6470857B2 (en) Unvoiced / voiced judgment for speech processing
US10141001B2 (en) Systems, methods, apparatus, and computer-readable media for adaptive formant sharpening in linear prediction coding
US7269561B2 (en) Bandwidth efficient digital voice communication system and method
CN107342094B (en) Very short pitch detection and coding
JPH09503874A (en) Method and apparatus for performing reduced rate, variable rate speech analysis and synthesis
CN105103229A (en) Decoder for generating frequency enhanced audio signal, method of decoding, encoder for generating an encoded signal and method of encoding using compact selection side information
KR20080097684A (en) A method for discriminating speech and music on real-time
US10672411B2 (en) Method for adaptively encoding an audio signal in dependence on noise information for higher encoding accuracy
JP4696418B2 (en) Information detection apparatus and method
Bäckström et al. Voice activity detection
Arun Sankar et al. Speech sound classification and estimation of optimal order of LPC using neural network
Srivastava et al. Performance evaluation of Speex audio codec for wireless communication networks
Liu et al. Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability
KR100984094B1 (en) A voiced/unvoiced decision method for the smv of 3gpp2 using gaussian mixture model
Park Signal Enhancement of a Variable Rate Vocoder with a Hybrid domain SNR Estimator
Nyshadham et al. Enhanced Voice Post Processing Using Voice Decoder Guidance Indicators
Van Pham et al. Voice activity detection algorithms using subband power distance feature for noisy environments.
Han et al. On A Reduction of Pitch Searching Time by Preprocessing in the CELP Vocoder

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07816895

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07816895

Country of ref document: EP

Kind code of ref document: A1