WO2009046658A1 - Procédé et appareil de détermination du type d'un signal audio non-bruit - Google Patents

Procédé et appareil de détermination du type d'un signal audio non-bruit Download PDF

Info

Publication number
WO2009046658A1
WO2009046658A1 PCT/CN2008/072455 CN2008072455W WO2009046658A1 WO 2009046658 A1 WO2009046658 A1 WO 2009046658A1 CN 2008072455 W CN2008072455 W CN 2008072455W WO 2009046658 A1 WO2009046658 A1 WO 2009046658A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
noise audio
signal
music
determining
Prior art date
Application number
PCT/CN2008/072455
Other languages
English (en)
French (fr)
Inventor
Jun Wang
Zhe Wang
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2009046658A1 publication Critical patent/WO2009046658A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Definitions

  • Embodiments of the present invention relate to the field of wireless communication technologies, and in particular, to a method and apparatus for determining a non-noise audio signal class. Background technique
  • VAD Voice Activity Detection
  • GSM Global System for Mobile communication
  • GSM enhanced full rate GSM enhanced full rate
  • GSM half rate GSM half rate
  • adaptive multi-rate speech coder the encoding on which they are based.
  • the algorithms are all different, but they all contain the VAD module that detects the speech signal from the communication signal.
  • the three kinds of VAD algorithms of GSM full rate, enhanced full rate and half rate have lower computational complexity, and the parameters used include signal energy, spectrum stability information and pitch information, among which, signal energy is the main decision basis.
  • signal energy is the main decision basis.
  • its sensitivity to noise is relatively high.
  • the latter two characteristic parameters only work on the decision threshold, but the dependence on the algorithm is relatively high, that is, it has a certain degree of coupling with the coding algorithm.
  • the ITU International Telecommunications Union
  • G.723.1 itself has embedded the VAD module into the coding algorithm.
  • the algorithm is relatively simple and the performance is general; G.729 is in The function of VAD is incorporated in Annex B (referred to as G.729B for short).
  • the VAD module of G.729B uses the 14-boundary decision technique in four-dimensional space and smoothes the results of multi-boundary decision to ensure the long-term stationary characteristics of natural speech signals, that is, in multidimensional space (4D) by 14 inequalities. Determined decision area.
  • the VAD algorithm of G.729B uses the full-band energy, low-band energy, zero-crossing rate and line-spectrum spectral parameters and their running statistical parameters, and has a considerable coupling degree with the encoding algorithm.
  • AMR the 3rd Generation Partnership Project, third-generation partner organization
  • AMR-WB and AMR-WB+ coding standards which also contain VAD modules.
  • the basic principle is to divide the signal into multiple sub-bands, calculate the sub-band parameters in each sub-band, and then The subband parameters are integrated in the full band, and finally the decision is made in the full band.
  • the AMR calculates the 9 subband energies of the input signal, while the AMR-WB and AMR-WB+ are divided into 12 subband energies.
  • AMR contains two VAD algorithms with different complexity and performance.
  • the main feature of AMR's VAD module is that the signal-to-noise ratio is the core of the background noise characteristic parameter estimation and decision logic. The complexity is low.
  • the pitch detection, tone detection and complex signal analysis modules included in the encoder use the open-loop gene of the encoder itself. The parameters of the analysis module are tightly coupled to the encoder algorithm.
  • the inventors have found that at least the following problems exist in the prior art: the coupling of the feature parameters used by the VAD module included in the algorithm of the existing speech coding standard with the encoder algorithm is tight, which is not conducive to the independence of the algorithm. Sex and portability.
  • Embodiments of the present invention provide a method and apparatus for determining a non-noise audio signal class, so as to implement a feature parameter that does not depend on an encoder algorithm, and enhances the independence and portability of the algorithm.
  • a technical solution of an embodiment of the present invention provides a method for determining a non-noise audio signal class, including: acquiring a feature parameter of a non-noise audio signal; and initially determining the non-noise by using a decision tree according to the feature parameter; a category of the audio signal; determining a category of the non-noise audio signal based on a context of the non-noise audio signal and a result of the preliminary decision.
  • the technical solution of the embodiment of the present invention further provides an apparatus for determining a non-noise audio signal category, including: a feature parameter acquiring unit, configured to acquire a feature parameter of a non-noise audio signal; and a first determining unit, configured to perform, according to the feature a parameter obtained by the parameter obtaining unit, using a decision tree to initially determine a category of the non-noise audio signal; a second determining unit, configured to determine a context of the non-noise audio signal and a preliminary decision of the first determining unit As a result, the category of the non-noise audio signal is determined.
  • the embodiment of the present invention determines the category of the non-noise audio signal by using the characteristic parameter of the non-noise audio signal that does not depend on the encoder algorithm, thereby enhancing the independence and portability of the algorithm.
  • FIG. 1 is a structural diagram of an apparatus for determining a category of a non-noise audio signal according to an embodiment of the present invention
  • FIG. 2 is a flow chart of a method for determining a category of a non-noise audio signal according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of state transition of a non-noise audio signal according to an embodiment of the present invention
  • FIG. 4 is a structural diagram of a multivariate decision tree according to an embodiment of the present invention
  • FIG. 5 is a flow chart of a preliminary decision method for a non-noise audio signal according to an embodiment of the present invention
  • FIG. 6 is a schematic structural diagram of a short-term decision tree according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a long-term decision tree according to an embodiment of the present invention. detailed description
  • a device for determining a non-noise audio signal class includes a feature parameter acquisition unit 11, a first decision unit 12, a second decision unit 13, and a state transition unit 14.
  • the first decision unit 12 is connected to the feature parameter obtaining unit 11 and the second decision unit 13, respectively;
  • the second decision unit 13 is connected to the state transition unit 14.
  • the feature parameter obtaining unit 11 is configured to acquire a feature parameter of the non-noise audio signal;
  • the first determining unit 12 is configured to initially determine the type of the non-noise audio signal by using a decision tree according to the feature parameter acquired by the feature parameter acquiring unit 11;
  • the determining unit 13 is configured to determine a category of the non-noise audio signal according to a context of the non-noise audio signal and a result of the preliminary decision of the first determining unit 12;
  • the state transition unit 14 is configured to be in the non-noise audio signal A transition state is added between the transition of the voice state and the music state.
  • the state transition unit 14 includes a state transition judging subunit 141, a duration judging subunit 142, and a converting subunit 143, wherein the converting subunit 143 is connected to the state transition judging subunit 141 and the duration judging subunit 142, respectively.
  • the state transition judging sub-unit 141 is configured to determine, according to the category of the non-noise audio signal determined by the second judging unit 13, and the category before the non-noise audio signal, whether the state of the non-noise audio signal is transferred; the duration judger The unit 142 is configured to determine whether the category of the non-noise audio signal continuously determines whether the time of the same type reaches a preset duration threshold; the conversion subunit 143 is configured to determine the judgment result or duration of the subunit 141 according to the state transition. The judgment result of the judgment sub-unit 142 is switched between the state of the non-noise audio signal and the transition state.
  • the characteristic parameter of the non-noise audio signal acquired by the feature parameter acquiring unit 11 includes at least one of the following parameters: normalized inter-frame spectral fluctuation flux; normalized inter-frame spectral fluctuation variance CElux; normalized inter-frame spectral fluctuation Variance sliding average varmovflux; normalized band spectral fluctuation fflux; normalized band spectral fluctuation variance CEflux; normalized band spectral fluctuation variance sliding average varmovfflux; normalized subband energy standard deviation stdave; energy ratio ratiol; energy The long-term average of the ratio mov- ratio1; the variance of the energy ratio var ratiol; the time-domain zero-crossing rate zcr; the harmonic structure stability characteristic hss.
  • the normalized interframe spectral fluctuation flux describes the change in the spectrum between the frame and frame of the non-noise audio signal. Among them, the flux of the music signal is relatively low and stable; the flux of the speech signal is usually high and the change is large. Normalized interframe spectral fluctuations by the formula Calculated.
  • SigFpw is the FFT transformed of the non-noise audio signal in the time domain range The resulting spectral amplitude signal.
  • FLUX—F 1 and FLUX—F2 are the boundaries of the band.
  • Norm is a normalization function
  • a special case of norm is:
  • Norm max(ave _ amp, AVE _E _ FL UX) where ave_amp is the average spectral amplitude of the current frame and the previous consecutive multiple frames.
  • AVE—E—FLUX is used to avoid the occurrence of a very small denominator.
  • An example of this is
  • Normalized band spectral fluctuations fflux describes the variation of the spectrum between subbands in the same frame of a non-noise audio signal. Among them, the fflux of the music signal is relatively low and stable; the fflux of the speech signal is usually high and the change is large. Normalized band spectrum fluctuation fflux by formula - SigFpw(i,
  • SigFpw is a spectral amplitude signal obtained by FFT transforming the non-noise audio signal in the time domain range.
  • Norm is a normalization function
  • a special case of norm is:
  • Norm max(ave _ amp, AVE _E _ FL UX) where ave_amp is the average spectral amplitude of the current frame and the previous consecutive multiple frames.
  • AVE—E—FLUX is used to avoid the occurrence of a very small denominator.
  • Normalized subband energy standard deviation stdave is the subband energy between several consecutive frames normalized The results of the standard deviation of the quantities are averaged. Normalized subband energy standard deviation stdave by formula Calculated.
  • j is the frame serial number
  • Bent is the number of subbands divided in the frequency domain
  • Bi denotes the band boundary of the i-th sub-band.
  • the energy ratio ratiol is the ratio of the low band energy to the total band energy.
  • the ratiol of the speech signal is usually large and the change is large; the ratiol of most music signals is usually small and the change is large.
  • Energy ratio ratiol by formula
  • Rl- Fl and Rl- F2 is a frequency band boundary and satisfies 0 ⁇ 1 -. F1 U 2 ⁇ " F ⁇ 5. Time Domain ZCR zcr.
  • W is 1 when A is truth and 0 when A is false.
  • a method for obtaining the stability characteristic hss of the harmonic structure is: First, obtaining a local peak of the signal according to a monotonically increasing interval and a monotonous decreasing interval of the FFT frequency signal.
  • the FFT frequency signal is regarded as a discrete multi-peak function, and the monotonically increasing interval and the monotonically decreasing interval of the function are searched to obtain local peaks and global peaks. The algorithm only needs to search for each frequency point and does not need to iterate. .
  • a plurality of normalized log peaks of the signal are obtained based on the plurality of largest local peaks.
  • the maximum local peaks [, ,..., ] and the positions of the largest local peaks are obtained, normalized and log is obtained to obtain multiple signals.
  • the normalized log peak reflects the harmonic structure estimate of the signal, which is determined by the formula
  • the normalized log peak of the signal in the embodiment of the invention has an average variance VLP by the formula
  • VLP 1 ⁇ ⁇ ( corpse - ALP 2
  • ALP is the mean of the A normalized log peaks in N frames, and the parameter is determined by the formula Calculated; to simplify complexity, ALP can also be replaced by a moving average.
  • the acquisition of the feature parameters does not depend on any encoder; and the acquisition of the feature parameters does not depend on the bandwidth, so that the GSAD does not depend on The signal sampling rate enhances the portability of the system.
  • the present embodiment determines the type of the non-noise audio signal by the characteristic parameter of the non-noise audio signal that does not depend on the encoder algorithm, which enhances the independence and portability of the algorithm.
  • FIG. 2 A method for determining a non-noise audio signal category in the embodiment of the present invention is shown in FIG. 2, and includes the following steps:
  • Step s201 Acquire a characteristic parameter of the non-noise audio signal.
  • the characteristic parameter of the non-noise audio signal includes at least one of the following parameters: normalized inter-frame spectral fluctuation flux; normalized 4 ⁇ inter-frame spectral fluctuation variance CElux; normalized 4 ⁇ inter-frame spectral fluctuation Variance sliding average varmovflux; normalized band spectral fluctuation fflux; normalized band spectral fluctuation variance CEflux; normalized band spectral fluctuation variance sliding average varmovfflux; normalized subband energy standard deviation stdave; energy ratio ratiol; The long-term average mov ratio of the energy ratio; the variance of the energy ratio var- ratiol; the time-domain zero-crossing rate zcr; the harmonic structure stability characteristic hss.
  • Step s202 Initially determine a category of the non-noise audio signal by using the decision tree according to the acquired feature parameter.
  • the decision tree in the embodiment of the present invention may be a multivariate decision tree or a univariate decision tree.
  • the decision tree is a univariate decision tree
  • multiple univariate decision trees may be used, and the multiple univariate decision trees may be used. Includes short-term decision trees and long-term decision trees.
  • Step s203 Determine the category of the non-noise audio signal according to the context of the non-noise audio signal and the result of the preliminary decision.
  • the process of determining the category of the non-noise audio signal in this embodiment is: setting the acquired non- The trailing protection value ⁇ of the characteristic parameter of the noise audio signal, ⁇ is a fixed value (Ho is 50 in this embodiment), and the value is initialized at the beginning of the decision on the category of the non-noise audio signal, greater than 0 Minus 1. If the trailing protection value Ho of any of the parameters of the trailing protection is greater than 0, then the music feature feature_mu or the speech feature feature_sp is set to 1.
  • the non-noise audio signal is determined to be a speech signal, a music signal, or an indeterminate signal based on the trailing protection value and the result of the preliminary decision.
  • step s102 If the result of the decision in step s102 is a music signal, and feature_mu is 0, feature_sp is 1; or the result of the decision in step s102 is speech, and feature-sp is 0, feature-mu is 1, and is uncertain.
  • the standard 'sureflg is 3.
  • the intermediate parameters can also be updated. For example, when updating two intermediate parameters: music count value music_Cnt and speech count value speech-Cnt, if the result of the smear protection is a speech signal or an indeterminate signal, then speech_Cnt force port 1, music — Cnt is set to 0, and speech—music—flgl is set to 1. If the result of the smearing is smeared, the result of the decision is music signal, then music—Cnt is incremented by 1, speech—Cnt is set to 0, and speech—music— flgl is set to 0. .
  • the speech-music-flgl is used to judge the indeterminate frame of the trailing protection decision (the frame with the initialflg not 0), and save the speech_music-flgl of the previous j3 ⁇ 4 with the variable speech-music-flg2.
  • the result of the tail protection decision is an uncertain signal
  • speech-music-flg2 is 1, the uncertain frame is determined to be a speech signal, otherwise the uncertain frame is determined to be a music signal.
  • Step s204 adding a transition state between the transition of the voice state and the music state by the non-noise audio signal.
  • a schematic diagram of state transition of a non-noise audio signal according to an embodiment of the present invention is shown in FIG. 3.
  • the non-noise audio signal includes four states: a voice state, a music state, a voice-to-music state, and a music-to-speech state, wherein the voice is The music state and the music to speech state are transitional states.
  • the non-noise audio signal When the non-noise audio signal needs to be converted from a music state to a voice state, that is, when it is determined that the category of the non-noise audio signal is a voice signal, and the category before the non-noise audio signal is a music signal, the non-noise audio signal Entering a music to a voice state from a music state; when the time of the non-noise audio signal continues to be determined as a time of the voice signal reaching a preset duration threshold, the non-noise audio signal enters the voice signal from a music to a voice state status.
  • the non-noise audio signal When the non-noise audio signal needs to be converted from a voice state to a music state, that is, when it is determined that the category of the non-noise audio signal is a music signal, and the category before the non-noise audio signal is a voice signal, the non-noise audio signal Entering a speech to a music state from a voice state; when the time of the non-noise audio signal is continuously determined to be a music signal reaches a preset duration threshold, the non-noise audio signal enters the music signal from a voice to a music state status.
  • the non-noise audio signal When the non-noise audio signal needs to be converted from a music state to a voice state, that is, when it is determined that the category of the non-noise audio signal is a voice signal, and the category before the non-noise audio signal is a music signal, the non-noise audio signal Entering a music to a voice state from a music state; when the time of the non-noise audio signal is continuously determined to be a voice signal does not reach a preset duration threshold, the non-noise audio signal enters the music from a music to a voice state Signal status.
  • the non-noise audio signal When the non-noise audio signal needs to be converted from a voice state to a music state, that is, when it is determined that the category of the non-noise audio signal is a music signal, and the category before the non-noise audio signal is a voice signal, the non-noise audio signal Entering a speech to a music state from a voice state; when the time of the non-noise audio signal is continuously determined to be a music signal does not reach a preset duration threshold, the non-noise audio signal enters the voice from a voice to a music state Signal status.
  • step s202 when the category of the non-noise audio signal is initially determined by using the multivariate decision tree, the hyperplane decision tree node including the plurality of feature parameters may be utilized to initially determine that the non-noise audio signal is a voice signal or a music signal.
  • the structure of a multivariate decision tree according to an embodiment of the present invention is shown in FIG. 4.
  • This embodiment utilizes a hyperplane decision tree node: Judging -0.1032*varflux + 0.4603 *varmovflux + 0.1662*varfflux + 0.0973 *varmovfflux + 0.9109 *stdave + 0.2181 *stdaveshort + 0.2824 *mov- ratio 1 + 0.2688*ratiol -0.285 l *var- ratio 1 -0.0053*zcr is less than or equal to 1.3641 to complete the preliminary decision; if yes, the non-noise audio signal is determined to be a music signal Set music_flag to 1; otherwise, determine that the non-noise audio signal is a voice signal, and set the speech_flag to 1.
  • step s202 when a non-noise audio signal is initially determined using a univariate decision tree
  • a preliminary decision method flow of a non-noise audio signal is shown in FIG. 5. Referring to FIG. 5, the embodiment includes the following steps:
  • Step s501 Acquire a music/speech probability of the current non-noise audio signal by using one or more decision tree nodes including a feature parameter and a preset parameter threshold corresponding to the feature parameter.
  • two decision trees are used, one decision tree is a long-term decision tree, and a parameter group reflecting long-term features (such as ⁇ varmovflux, varmovfflux, stdAve, mov_ratiol ⁇ ) is used, and another decision tree is short.
  • Time decision tree using parameter groups that reflect short
  • the structure of the short-term decision tree in this embodiment is shown in Fig. 6, and the structure of the long-term decision tree is as shown in Fig. 7.
  • Fig. 6 The structure of the short-term decision tree in this embodiment is shown in Fig. 6, and the structure of the long-term decision tree is as shown in Fig. 7.
  • the short-term decision tree shown in Figure 6 first determine the first-level tree node: If Varflux is less than 1.02311, enter the left child node, and vice versa. Assuming that Varflux is less than 1.02311, the next step is to determine whether Var- ratiol is less than 29.1444. If yes, then enter the leaf node, that is, the output music probability is 95.7%, and the speech probability is 4.3%; otherwise, the right child node is continued to be judged, and so on.
  • Step s502 Select the maximum music/speech probability of the music/speech probability obtained by the short-term decision tree and the long-term decision tree as the music/speech probability of the current non-noise audio signal.
  • Step s503 Determine, according to the music/speech probability of the current non-noise audio signal and a preset probability threshold, the current non-noise audio signal as a voice signal, a music signal, or an indeterminate signal.
  • the music probability (or speech probability) output by the decision tree is greater than a preset first probability threshold (0.8 in this embodiment), and the speech probability (or musical probability) is less than or equal to the preset a second probability threshold (0.6 in the embodiment)
  • determining that the current non-noise audio signal is a music signal (or a voice signal)
  • otherwise determining that the current non-noise audio signal is an indeterminate signal make sure the flag uncertain is 1.
  • Step s504 determining the current non-noise according to the determination result of the current non-noise audio signal, and the number of frames of adjacent continuous speech signal frames or adjacent continuous music signal frames.
  • the audio signal is a voice signal or a music signal.
  • two global intermediate parameters are preset for each decision: music_Cnt and speech-Cnt.
  • Music - Cnt is the number of frames of the music signal continuously determined in the first few frames of the current frame of the non-noise audio signal; speech - Cnt is continuously determined as the voice signal in the first few frames of the current frame of the non-noise audio signal The number of frames.
  • the present embodiment determines the type of the non-noise audio signal by the characteristic parameter of the non-noise audio signal that does not depend on the encoder algorithm, which enhances the independence and portability of the algorithm.
  • the present invention can be implemented by hardware or by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product that can be stored in a non-volatile storage medium.
  • a computer device (may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Description

一种确定非噪声音频信号类别的方法及装置
技术领域
本发明实施例涉及无线通信技术领域,特别是涉及一种确定非噪 声音频信号类别的方法及装置。 背景技术
随着无线通信技术的迅速发展, VAD ( Voice Activity Detection, 语音活动检测)技术得到了广泛的应用。每一种 VAD方法都使用了多 种特征参数,其中大部分来自或派生于编码器编码过程中产生的特征 参数。 例如 GSM ( Global System for Mobile communication, 全球移动 通信系统)制订了四种语音编码器规范, 即 GSM全速率、 GSM增强 全速率、 GSM半速率和自适应多速率语音编码器,它们所依据的编码 算法均有所区别,但是均包含了将语音信号从通信信号中检测出来的 VAD模块。 其中, GSM全速率、 增强全速率和半速率这三种 VAD算 法的计算复杂度比较低, 使用的参数都包含信号的能量、 频谱稳定信 息和基音信息等, 其中, 信号能量是主要判决依据, 但它对噪声的敏 感度比较高, 后两种特征参数仅对判决阔值起作用, 但对算法的依赖 度比较高, 即与编码算法有一定的耦合度。
ITU ( International Telecommunications Union, 国际电信联盟 )制 订了 G.723.1和 G.729系列编码标准, 其中, G.723.1本身已经将 VAD模 块嵌入编码算法中,算法相对简单,性能一般; G.729则在其附件 B (简 称为 G.729B )中融入了 VAD的功能。 G.729B的 VAD模块釆用了四维空 间中的 14边界决策技术,并对多边界决策结果进行平滑以保证自然语 音信号的长时平稳特性, 即在多维空间中 (4维) 由 14项不等式确定 的决策区域。 G.729B的 VAD算法用到了全带能量、 低带能量、 过零率 和线谱对谱参数及其运行时的统计参数, 与编码算法有相当的耦合 度。
3GPP ( the 3rd Generation Partnership Project, 第三代伙伴组织计 划)组织制订了 AMR、 AMR-WB和 AMR-WB+编码标准 , 它们也都 含有 VAD模块, 其基本原理都是将信号分成多个子带, 在每一子带内 计算子带参数, 然后将这些子带参数在全带进行综合, 最后在全带进 行判决, 其中的一个区别是 AMR计算输入信号的 9个子带能量, 而 AMR-WB和 AMR-WB+则分为 12个子带能量。 AMR包含两种 VAD算 法, 有着不同的复杂度和性能。 AMR的 VAD模块主要特点是以信噪 比作为背景噪声特征参数估计和判决逻辑的核心, 复杂度较低, 其包 含的基音检测、音调检测和复杂信号分析模块都用到了编码器本身开 环基因分析模块的参数 , 与编码器算法的耦合度比较紧。
在实现本发明过程中, 发明人发现现有技术中至少存在如下问 题:现有语音编码标准的算法所含的 VAD模块使用的特征参数与编码 器算法的耦合度较紧, 不利于算法的独立性和可移植性。
发明内容
本发明实施例提供一种确定非噪声音频信号类别的方法及装置, 以实现釆用的特征参数不依赖于编码器算法,增强算法的独立性和可 移植性。
为达到上述目的,本发明实施例的技术方案提供一种确定非噪声 音频信号类别的方法, 包括: 获取非噪声音频信号的特征参数; 根据 所述特征参数, 利用决策树初步判决所述非噪声音频信号的类别; 根 据所述非噪声音频信号的语境和所述初步判决的结果,确定所述非噪 声音频信号的类别。
本发明实施例的技术方案还提供了一种确定非噪声音频信号类 别的装置, 包括: 特征参数获取单元, 用于获取非噪声音频信号的特 征参数; 第一判决单元, 用于根据所述特征参数获取单元获取的特征 参数, 利用决策树初步判决所述非噪声音频信号的类别; 第二判决单 元,用于根据所述非噪声音频信号的语境和所述第一判决单元的初步 判决的结果, 确定所述非噪声音频信号的类别。 本发明实施例通过不依赖于编码器算法的非噪声音频信号的特 征参数, 对所述非噪声音频信号的类别进行判定, 增强了算法的独立 性和可移植性。 附图说明
图 1是本发明实施例的一种确定非噪声音频信号类别的装置结构 图;
图 2是本发明实施例的一种确定非噪声音频信号类别的方法流程 图;
图 3是本发明实施例的一种非噪声音频信号的状态转移示意图; 图 4是本发明实施例的一种多变量决策树的结构图;
图 5是本发明实施例的一种非噪声音频信号的初步判决方法流程 图;
图 6是本发明实施例的一种短时决策树的结构示意图;
图 7是本发明实施例的一种长时决策树的结构示意图。 具体实施方式
下面结合附图和实施例,对本发明的具体实施方式作进一步详细 描述。
本发明实施例的一种确定非噪声音频信号类别的装置如图 1所 示, 包括特征参数获取单元 11、 第一判决单元 12、 第二判决单元 13 和状态转移单元 14。 其中, 第一判决单元 12分别与特征参数获取单元 11和第二判决单元 13连接; 第二判决单元 13和状态转移单元 14连接。
特征参数获取单元 11用于获取非噪声音频信号的特征参数;第一 判决单元 12用于根据特征参数获取单元 11获取的特征参数,利用决策 树初步判决所述非噪声音频信号的类别;第二判决单元 13用于根据非 噪声音频信号的语境和第一判决单元 12的初步判决的结果,确定所述 非噪声音频信号的类别;状态转移单元 14用于在所述非噪声音频信号 进行语音状态和音乐状态的转换之间加入过渡状态。
状态转移单元 14包括状态转移判断子单元 141、 持续时间判断子 单元 142和转换子单元 143 , 其中转换子单元 143分别与状态转移判断 子单元 141和持续时间判断子单元 142连接。
状态转移判断子单元 141用于根据第二判决单元 13确定的非噪声 音频信号的类别, 与所述非噪声音频信号之前的类别, 判断所述非噪 声音频信号的状态是否转移; 持续时间判断子单元 142用于判断所述 非噪声音频信号的类别持续确定为同一类型的时间是否到达预先设 置的持续时间门限值; 转换子单元 143用于根据状态转移判断子单元 141的判断结果或持续时间判断子单元 142的判断结果,在所述非噪声 音频信号的状态与过渡状态之间进行转换。
特征参数获取单元 11获取的非噪声音频信号的特征参数包括如 下参数中的至少一个: 归一化帧间谱波动 flux; 归一化帧间谱波动的 方差 varflux; 归一化帧间谱波动的方差滑动平均 varmovflux; 归一化 频带谱波动 fflux; 归一化频带谱波动的方差 varfflux; 归一化频带谱波 动的方差滑动平均 varmovfflux; 归一化子带能量标准差 stdave; 能量 比率 ratiol; 能量比率的长时平均 mov— ratiol; 能量比率的方差 var ratiol ; 时域过零率 zcr; 谐波结构稳定性特征 hss。
下面对所述非噪声音频信号的特征参数进行说明:
1.归一化帧间谱波动 flux及其衍生的归一化帧间谱波动的方差 varflux和归一化帧间谱波动的方差滑动平均 varmovflux。
归一化帧间谱波动 flux描述了非噪声音频信号的帧和帧之间频谱 的变化。 其中, 音乐信号的 flux 比较低, 平稳; 语音信号的 flux通常 比较高, 变化大。 归一化帧间谱波动 flux由公式
Figure imgf000006_0001
计算得到。
其中, SigFpw为在时域范围内所述非噪声音频信号经 FFT变换后 得到的频谱幅度信号。
FLUX— F 1和 FLUX— F2为频带的边界 , 在 16KHz釆样模式下的一 个实例是 FLUX— F 1=3 , FLUX— F2=95; 在 8KHz釆样模式下的一个实 例是 FLUX— Fl=l, FLUX_F2=470
norm为归一化函数, norm的一个特例是:
norm = max(ave _ amp, AVE _E _ FL UX) 其中 , ave— amp为当前帧与前面连续多帧的平均频谱幅度。
AVE— E— FLUX用来避免出现极小分母的情况, 其一个实例是
AVE— E— FLUX =1000。
2.归一化频带谱波动 fflux及其衍生的归一化频带谱波动的方差 varfflux和归一化频带谱波动的方差滑动平均 varmovfflux。
归一化频带谱波动 fflux描述了非噪声音频信号的同一帧中各子 带间频谱的变化。 其中, 音乐信号的 fflux 比较低, 平稳; 语音信号 的 fflux通常比较高, 变化大。 归一化频带谱波动 fflux由公式
Figure imgf000007_0001
- SigFpw(i,
fflux =——―
norm 计算得到。
其中, SigFpw为在时域范围内所述非噪声音频信号经 FFT变换后 得到的频谱幅度信号。
FFLUX— F1为频带的边界, 在 16KHz釆样模式下的一个实例是 FFLUX— F 1=63; 在 8KHz釆样模式下的一个实例是 FFLUX— F 1 =32。
norm为归一化函数, norm的一个特例是:
norm = max(ave _ amp, AVE _E _ FL UX) 其中 , ave— amp为当前帧与前面连续多帧的平均频谱幅度。
AVE— E— FLUX用来避免出现极小分母的情况, 其一个实例是 AVE— E— FLUX =1000»
3.归一化子带能量标准差 stdave。
归一化子带能量标准差 stdave为归一化的几个连续帧间的子带能 量的标准差的结果求平均。 归一化子带能量标准差 stdave由公式
Figure imgf000008_0001
计算得到。
其中, 1为子带序列号;
j为帧序列号;
Tien为连续帧的个数: 实例中的可取 Tlen=4个连续帧以提取短时 特征, 可取 Tlen=16个连续帧以提取长时特征;
Bent为频域划分出的子带的个数;
lev ( i, j ) 由公式
Figure imgf000008_0002
计算得到;
其中, Bi表示第 i个子带的频带边界。
4.能量比率 ratiol及能量比率的长时平均 mov— ratiol和能量比率 的方差 var— ratio 1。
能量比率 ratiol为低带能量占全带能量的比率, 语音信号的 ratiol 通常比较大, 变化也大; 大多数音乐信号的 ratiol 通常比较小, 变化 比较大。 能量比率 ratiol由公式
Rl F2
. Λ k=Rl FX
ratio! =
Figure imgf000008_0003
计算得到;
其中, Rl— Fl和 Rl— F2是频带边界并且满足 0≤ 1- F1U2≤"F^ 。 5.时域过零率 zcr。
在非噪声音频信号的语音中由于会间隔出现清音,所以会出现较 音乐高的 zcr。 时域过零率 zcr由公式 zcr = ¾ Il{ ( - l) < 0}
ί i-l 计算得到;
其中, 当 A为 truth时, W 为 1 ; 当 A为 false时, 为 0。
6.谐波结构稳定性特征 hss。
对于大多数音乐信号, 其谐波结构的稳定性显著高于语音。 现有 技术中在计算该特征参数时,需要估计信号的谐波结构,复杂度很高。 本发明实施例的一种获取所述谐波结构稳定性特征 hss的方法为: 首先, 根据 FFT频语信号的单调递增区间和单调递减区间, 获取 所述信号的本地峰值。 本发明实施例将 FFT频语信号视为离散多峰值 函数, 搜索该函数的单调递增区间和单调递减区间, 从而得到本地峰 值和全局峰值, 该算法只需对各频率点搜索一遍并且不需要迭代。
然后,根据多个最大的本地峰值,获取所述信号的多个归一化 log 峰值。 本发明实施例以 A个最大的本地峰值为例, 求最大的 A个本地 峰值 [ , ,…, ] 及各个最大的本地峰值出现的位置, 归一化并求 log 得到所述信号的多个归一化 log峰值 [ , ,···, ]。归一化 log峰值反 映的是信号的谐波结构估计, 该参数由公式
Li =log(^)-log(∑^)
(j=l , 2—, A )计算得到。 最后, 根据所述信号的多个归一化 log峰值, 获取所述信号的归 一化 log峰值的平均方差。 本发明实施例中信号的归一化 log峰值的平 均方差 VLP由公式
A i
VLP1 = ^ ∑ ( 尸 - ALP 2
j=l k=i-l9 计算得到;
其中, ALP为 A个归一化 log峰值在 N帧内的均值, 该参数由公式
Figure imgf000010_0001
计算得到; 为简化复杂度, ALP还可用滑动平均代替。
由于在获取非噪声音频信号的特征参数时,不是在进行编码算法 过程中获取的, 因此特征参数的获取不依赖于任何编码器; 而且特征 参数的获取也不依赖带宽, 从而使得 GSAD不依赖于信号釆样率, 增 强了系统的可移植性。
本实施例通过不依赖于编码器算法的非噪声音频信号的特征参 数, 对所述非噪声音频信号的类别进行判定, 增强了算法的独立性和 可移植性。
本发明实施例的一种确定非噪声音频信号类别的方法流程如图 2 所示, 包括以下步骤:
步骤 s201 , 获取非噪声音频信号的特征参数。 本发明实施例中, 非噪声音频信号的特征参数包括如下参数中的至少一个: 归一化帧间 谱波动 flux; 归一 4匕帧间谱波动的方差 varflux; 归一 4匕帧间谱波动的 方差滑动平均 varmovflux; 归一化频带谱波动 fflux; 归一化频带谱波 动的方差 varfflux; 归一化频带谱波动的方差滑动平均 varmovfflux; 归 一化子带能量标准差 stdave; 能量比率 ratiol ; 能量比率的长时平均 mov ratiol ; 能量比率的方差 var— ratiol ; 时域过零率 zcr; 谐波结构 稳定性特征 hss。
步骤 s202, 根据获取的特征参数, 利用决策树初步判决非噪声音 频信号的类别。
本发明实施例中的决策树可以为多变量决策树,也可以为单变量 决策树; 当决策树为单变量决策树时, 可以使用多棵单变量决策树, 该多棵单变量决策树可以包括短时决策树和长时决策树。
步骤 s203 , 才艮据非噪声音频信号的语境和初步判决的结果, 确定 非噪声音频信号的类别。
本实施例中确定非噪声音频信号的类别的过程为:设置获取的非 噪声音频信号的特征参数的拖尾保护值 Ηθ , Ηθ为一个固定值(本实 施例中 Ho为 50 ), 该值在对所述非噪声音频信号的类别的判决开始时 进行初始化, 大于 0则减 1。 如果拖尾保护的任何一个参数的拖尾保护 值 Ho大于 0, 则置音乐特征 feature— mu或语音特征 feature— sp为 1。 根据 拖尾保护值和初步判决的结果, 判定所述非噪声音频信号为语音信 号、音乐信号或不确定信号。如果步骤 sl02中判决的结果为音乐信号, 而 feature— mu为 0 , feature— sp为 1; 或步骤 sl02中判决的结果为语音, 而 feature— sp为 0 , feature— mu为 1 , 置不确定标' uncertainflg为 3。
在经过拖尾保护之后, 还可以对中间参数进行更新。 例如, 当更 新两个中间参数: 音乐计数值 music— Cnt和语音计数值 speech— Cnt时, 如果经过拖尾保护, 判定的结果为语音信号或不确定信号, 则 speech— Cnt力口 1 , music— Cnt置 0 , 并且置 speech— music— flgl为 1; 如果 经过拖尾保护,判定的结果为音乐信号,则 music— Cnt加 1 , speech— Cnt 置 0 , 并且置 speech— music— flgl为 0。 其中 speech— music— flgl是用来对 拖尾保护判决的不确定帧 (uncertainflg不为 0的帧)进行判决, 用变 量 speech— music— flg2保存前一†j¾的 speech— music— flgl , 当拖尾保护判 决的结果为不确定信号时, 如果 speech— music— flg2为 1 , 则判决该不 确定帧为语音信号, 否则判决该不确定帧为音乐信号。
步骤 s204,在非噪声音频信号进行语音状态和音乐状态的转换之 间加入过渡状态。本发明实施例的一种非噪声音频信号的状态转移示 意图如图 3所示, 该非噪声音频信号包括四种状态: 语音状态、 音乐 状态、语音到音乐状态和音乐到语音状态, 其中语音到音乐状态和音 乐到语音状态为过渡状态。
当非噪声音频信号需要由音乐状态转换到语音状态,即当确定所 述非噪声音频信号的类别为语音信号,且所述非噪声音频信号之前的 类别为音乐信号时,所述非噪声音频信号由音乐状态进入音乐到语音 状态; 当所述非噪声音频信号的类别持续确定为语音信号的时间到达 预先设置的持续时间门限值时,所述非噪声音频信号由音乐到语音状 态进入语音信号状态。 当非噪声音频信号需要由语音状态转换到音乐状态,即当确定所 述非噪声音频信号的类别为音乐信号,且所述非噪声音频信号之前的 类别为语音信号时 ,所述非噪声音频信号由语音状态进入语音到音乐 状态; 当所述非噪声音频信号的类别持续确定为音乐信号的时间到达 预先设置的持续时间门限值时,所述非噪声音频信号由语音到音乐状 态进入音乐信号状态。
当非噪声音频信号需要由音乐状态转换到语音状态,即当确定所 述非噪声音频信号的类别为语音信号,且所述非噪声音频信号之前的 类别为音乐信号时,所述非噪声音频信号由音乐状态进入音乐到语音 状态; 当所述非噪声音频信号的类别持续确定为语音信号的时间没有 到达预先设置的持续时间门限值时,所述非噪声音频信号由音乐到语 音状态进入音乐信号状态。
当非噪声音频信号需要由语音状态转换到音乐状态,即当确定所 述非噪声音频信号的类别为音乐信号,且所述非噪声音频信号之前的 类别为语音信号时 ,所述非噪声音频信号由语音状态进入语音到音乐 状态; 当所述非噪声音频信号的类别持续确定为音乐信号的时间没有 到达预先设置的持续时间门限值时,所述非噪声音频信号由语音到音 乐状态进入语音信号状态。
在步骤 s202中, 当利用多变量决策树初步判决非噪声音频信号的 类别时, 可以利用包括多个特征参数的超平面决策树结点, 初步判决 所述非噪声音频信号为语音信号或音乐信号。本发明实施例的一种多 变量决策树的结构如图 4所示,本实施例利用一个超平面决策树结点: 判断 -0.1032*varflux + 0.4603 *varmovflux + 0.1662*varfflux + 0.0973 *varmovfflux + 0.9109*stdave + 0.2181 *stdaveshort + 0.2824 *mov— ratio 1 + 0.2688*ratiol -0.285 l *var— ratio 1 -0.0053*zcr是否 小于或等于 1.3641完成初步判决; 如果是, 则判定非噪声音频信号为 音乐信号, 置 music— flag为 1 ; 否则判定非噪声音频信号为语音信号, 置 speech— flag为 1。
在步骤 s202中, 当利用单变量决策树初步判决非噪声音频信号的 类别时,本发明实施例的一种非噪声音频信号的初步判决方法流程如 图 5所示。 参照图 5 , 本实施例包括以下步骤:
步骤 s501 , 利用一个或多个包括一个特征参数及预先设定的与所 述特征参数对应的参数门限值的决策树结点,获取当前非噪声音频信 号的音乐 /语音概率。 本实施例中釆用两棵决策树, 一棵决策树为长 时决策树, 釆用反映长时特征的参数组(如 { varmovflux, varmovfflux, stdAve, mov_ratiol } ), 另一棵决策树为短时决策树, 釆用反映短时特 征的参数组 ( ^口 { varflux, varfflux, stdAveshort, ratio 1, var ratiol, zcr} )。 本实施例中短时决策树的结构如图 6所示, 长时决策树的结构 如图 7所示。 以图 6所示的短时决策树为例, 首先判断第一层树结点: 如果 Varflux小于 1.02311成立, 则进入左边的子结点, 反之进入右边 的子结点。假设 Varflux小于 1.02311, 则下一步判断 Var— ratiol是否小于 29.1444, 如果是, 则进入叶子节点, 即输出音乐概率为 95.7%, 语音 概率为 4.3%; 否则继续判断其右子结点, 依次类推, 获取在短时决策 树中当前非噪声音频信号的音乐 /语音概率。 在图 7所示的长时决策树 中获取当前非噪声音频信号的音乐 /语音概率的过程与图 6所示的短 时决策树类似。
步骤 s502,选择所述短时决策树和长时决策树获取的音乐 /语音概 率中最大的音乐 /语音概率,为当前非噪声音频信号的音乐 /语音概率。
步骤 s503 ,根据所述当前非噪声音频信号的音乐 /语音概率和预先 设定的概率门限值, 判定所述当前非噪声音频信号为语音信号、 音乐 信号或不确定信号。 本实施例中如果决策树输出的音乐概率(或语音 概率) 大于预先设定的第一概率门限值(本实施例中为 0.8 ), 且语音 概率(或音乐概率)小于或等于预先设定的第二概率门限值(本实施 例中为 0.6 ), 则判定所述当前非噪声音频信号为音乐信号(或语音信 号), 否则判定所述当前非噪声音频信号为不确定信号, 置不确定标 志 uncertain为 1。
步骤 s504, 才艮据对所述当前非噪声音频信号的判定结果, 和相邻 连续语音信号帧或相邻连续音乐信号帧的帧数,判定所述当前非噪声 音频信号为语音信号或音乐信号。 本实施例中, 对每次判决, 预先设 置两个全局的中间参数: music— Cnt和 speech— Cnt。 music— Cnt为所述 非噪声音频信号的当前帧的前几帧中连续判决为音乐信号的帧数; speech— Cnt为所述非噪声音频信号的当前帧的前几帧中连续判决为语 音信号的帧数。 对于不确定标志 uncertain为 1的帧, 如果 speech— Cnt 大于 1 , 则将当前帧判决为语音信号的帧, 如果 music— Cnt大于 10, 则 将当前帧判决为音乐信号的帧。对于在步骤 s503中已经确定的判决不 进行改变。
本实施例通过不依赖于编码器算法的非噪声音频信号的特征参 数, 对所述非噪声音频信号的类别进行判定, 增强了算法的独立性和 可移植性。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解 到本发明, 可以通过硬件实现, 也可以借助软件加必要的通用硬件平 台的方式来实现。基于这样的理解, 本发明的技术方案可以以软件产 品的形式体现出来, 该软件产品可以存储在一个非易失性存储介质
(可以是 CD-ROM, U盘, 移动硬盘等) 中, 包括若干指令用以使得 一台计算机设备(可以是个人计算机, 服务器, 或者网络设备等)执 行本发明各个实施例所述的方法。
总之, 以上所述仅为本发明的较佳实施例而已, 并非用于限定本 发明的保护范围。 凡在本发明的精神和原则之内, 所作的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权利要求
1、 一种确定非噪声音频信号类别的方法, 其特征在于, 包括: 获取非噪声音频信号的特征参数;
根据所述特征参数,利用决策树初步判决所述非噪声音频信号的 类别;
才艮据所述非噪声音频信号的语境和所述初步判决的结果,确定所 述非噪声音频信号的类别。
2、如权利要求 1所述确定非噪声音频信号类别的方法, 其特征在 于, 所述特征参数包括如下参数中的至少一个:
归一化帧间谱波动 flux; 归一化帧间谱波动的方差 varflux; 归一 化帧间谱波动的方差滑动平均 varmovflux; 归一化频带谱波动 fflux; 归一化频带谱波动的方差 varfflux; 归一化频带谱波动的方差滑动平 均 varmovfflux; 归一化子带能量标准差 stdave; 能量比率 ratiol ; 能量 比率的长时平均 mov— ratiol ; 能量比率的方差 var— ratiol ; 时域过零 率 zcr; 谐波结构稳定性特征 hss。
3、如权利要求 1所述确定非噪声音频信号类别的方法, 其特征在 于, 当所述决策树为多变量决策树时, 所述利用决策树初步判决非噪 声音频信号的类别, 具体包括: 利用包括多个特征参数的超平面决策 树结点, 初步判决所述非噪声音频信号为语音信号或音乐信号。
4、如权利要求 1所述确定非噪声音频信号类别的方法, 其特征在 于, 当所述决策树为单变量决策树时, 所述利用决策树初步判决非噪 声音频信号的类别, 具体包括:
利用一个或多个包括一个特征参数及预先设定的与所述特征参 数对应的参数门限值的决策树结点,获取当前非噪声音频信号的音乐 /语音概率;
根据所述当前非噪声音频信号的音乐 /语音概率和预先设定的概 率门限值, 判定所述当前非噪声音频信号为语音信号、 音乐信号或不 确定信号; 才艮据对所述当前非噪声音频信号的判定结果,和相邻连续语音信 号帧或相邻连续音乐信号帧的帧数,判定所述当前非噪声音频信号为 语音信号或音乐信号。
5、如权利要求 4所述确定非噪声音频信号类别的方法, 其特征在 于, 当利用多棵单变量决策树获取当前非噪声音频信号的音乐 /语音 概率时, 在每棵单变量决策树获取到音乐 /语音概率之后, 还包括: 选择所述多棵单变量决策树获取的音乐 /语音概率中最大的音乐 /语音 概率为当前非噪声音频信号的音乐 /语音概率。
6、如权利要求 5所述确定非噪声音频信号类别的方法, 其特征在 于, 所述多棵单变量决策树包括短时决策树和长时决策树。
7、如权利要求 1所述确定非噪声音频信号类别的方法, 其特征在 于, 所述根据非噪声音频信号的语境和初步判决的结果, 确定非噪声 音频信号的类别, 具体包括:
根据拖尾保护值和初步判决的结果,判定所述非噪声音频信号为 语音信号或音乐信号。
8、如权利要求 1所述确定非噪声音频信号类别的方法, 其特征在 于, 在确定非噪声音频信号的类别之后, 还包括: 在所述非噪声音频 信号进行语音状态和音乐状态的转换之间加入过渡状态。
9、如权利要求 8所述确定非噪声音频信号类别的方法, 其特征在 于,所述在非噪声音频信号进行语音状态和音乐状态的转换之间加入 过渡状态, 具体包括:
当确定所述非噪声音频信号的类别为语音信号,且所述非噪声音 频信号之前的类别为音乐信号时,所述非噪声音频信号由音乐状态进 入过渡 态;
当所述非噪声音频信号的类别持续确定为语音信号的时间到达 预先设置的持续时间门限值时,所述非噪声音频信号由过渡状态进入 语音信号状态; 以及
当确定所述非噪声音频信号的类别为音乐信号,且所述非噪声音 频信号之前的类别为语音信号时,所述非噪声音频信号由语音状态进 入过渡 态;
当所述非噪声音频信号的类别持续确定为音乐信号的时间到达 预先设置的持续时间门限值时,所述非噪声音频信号由过渡状态进入 音乐信号状态。
10、 如权利要求 9所述确定非噪声音频信号类别的方法, 其特征 在于,所述在非噪声音频信号进行语音状态和音乐状态的转换之间加 入过渡状态, 具体包括:
当确定所述非噪声音频信号的类别为语音信号,且所述非噪声音 频信号之前的类别为音乐信号时,所述非噪声音频信号由音乐状态进 入过渡 态;
当所述非噪声音频信号的类别持续确定为语音信号的时间没有 到达预先设置的持续时间门限值时,所述非噪声音频信号由过渡状态 进入音乐信号状态; 以及
当确定所述非噪声音频信号的类别为音乐信号,且所述非噪声音 频信号之前的类别为语音信号时,所述非噪声音频信号由语音状态进 入过渡 态;
当所述非噪声音频信号的类别持续确定为音乐信号的时间没有 到达预先设置的持续时间门限值时,所述非噪声音频信号由过渡状态 进入语音信号状态。
11、 如权利要求 2所述确定非噪声音频信号类别的方法, 其特征 在于, 获取所述谐波结构稳定性特征 hss的方法包括以下步骤:
根据 FFT频谱信号的单调递增区间和单调递减区间, 获取所述信 号的本地峰值;
根据多个最大的本地峰值, 获取所述信号的多个归一化 log峰值; 根据所述信号的多个归一化 log峰值, 获取所述信号的归一化 log 峰值的平均方差。
12、 一种确定非噪声音频信号类别的装置, 其特征在于, 包括 特征参数获取单元, 用于获取非噪声音频信号的特征参数; 第一判决单元, 用于根据所述特征参数获取单元获取的特征参 数, 利用决策树初步判决所述非噪声音频信号的类别; 第二判决单元,用于根据所述非噪声音频信号的语境和所述第一 判决单元的初步判决的结果, 确定所述非噪声音频信号的类别。
13、如权利要求 12所述确定非噪声音频信号类别的装置, 其特征 在于, 还包括状态转移单元, 用于在所述非噪声音频信号进行语音状 态和音乐状态的转换之间加入过渡状态。
14、如权利要求 13所述确定非噪声音频信号类别的装置, 其特征 在于, 所述状态转移单元包括:
状态转移判断子单元,用于根据所述第二判决单元确定的非噪声 音频信号的类别, 与所述非噪声音频信号之前的类别, 判断所述非噪 声音频信号的状态是否转移;
持续时间判断子单元,用于判断所述非噪声音频信号的类别持续 确定为同一类型的时间是否到达预先设置的持续时间门限值;
转换子单元,用于根据所述状态转移判断子单元的判断结果或持 续时间判断子单元的判断结果,在所述非噪声音频信号的状态与过渡 状态之间进行转换。
15、如权利要求 12至 14任一项所述确定非噪声音频信号类别的装 置, 其特征在于, 所述特征参数包括如下参数中的至少一个:
归一化帧间谱波动 flux; 归一化帧间谱波动的方差 varflux; 归一 化帧间谱波动的方差滑动平均 varmovflux; 归一化频带谱波动 fflux; 归一化频带谱波动的方差 varfflux; 归一化频带谱波动的方差滑动平 均 varmovfflux; 归一化子带能量标准差 stdave; 能量比率 ratiol ; 能量 比率的长时平均 mov— ratiol ; 能量比率的方差 var— ratiol ; 时域过零 率 zcr; 谐波结构稳定性特征 hss。
PCT/CN2008/072455 2007-09-30 2008-09-23 Procédé et appareil de détermination du type d'un signal audio non-bruit WO2009046658A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710149984.X 2007-09-30
CN 200710149984 CN101399039B (zh) 2007-09-30 2007-09-30 一种确定非噪声音频信号类别的方法及装置

Publications (1)

Publication Number Publication Date
WO2009046658A1 true WO2009046658A1 (fr) 2009-04-16

Family

ID=40517544

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/072455 WO2009046658A1 (fr) 2007-09-30 2008-09-23 Procédé et appareil de détermination du type d'un signal audio non-bruit

Country Status (2)

Country Link
CN (1) CN101399039B (zh)
WO (1) WO2009046658A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9548713B2 (en) 2013-03-26 2017-01-17 Dolby Laboratories Licensing Corporation Volume leveler controller and controlling method
CN113539262A (zh) * 2021-07-09 2021-10-22 广东金鸿星智能科技有限公司 一种用于电动门语音控制的声音增强及收录方法和系统

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9111531B2 (en) * 2012-01-13 2015-08-18 Qualcomm Incorporated Multiple coding mode signal classification
EP3933836A1 (en) * 2012-11-13 2022-01-05 Samsung Electronics Co., Ltd. Method and apparatus for determining encoding mode, method and apparatus for encoding audio signals, and method and apparatus for decoding audio signals
CN104091599B (zh) * 2013-07-18 2016-06-29 腾讯科技(深圳)有限公司 一种音频文件的处理方法及装置
DK3379535T3 (da) * 2014-05-08 2019-12-16 Ericsson Telefon Ab L M Audiosignalklassifikator
CN104464722B (zh) * 2014-11-13 2018-05-25 北京云知声信息技术有限公司 基于时域和频域的语音活性检测方法和设备
CN106328169B (zh) 2015-06-26 2018-12-11 中兴通讯股份有限公司 一种激活音修正帧数的获取方法、激活音检测方法和装置
CN107564512B (zh) * 2016-06-30 2020-12-25 展讯通信(上海)有限公司 语音活动侦测方法及装置
EP3803861B1 (en) * 2019-08-27 2022-01-19 Dolby Laboratories Licensing Corporation Dialog enhancement using adaptive smoothing
CN110970050B (zh) * 2019-12-20 2022-07-15 北京声智科技有限公司 语音降噪方法、装置、设备及介质
CN112017639B (zh) * 2020-09-10 2023-11-07 歌尔科技有限公司 语音信号的检测方法、终端设备及存储介质
CN113238206B (zh) * 2021-04-21 2022-02-22 中国科学院声学研究所 一种基于判决统计量设计的信号检测方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1920947A (zh) * 2006-09-15 2007-02-28 清华大学 用于低比特率音频编码的语音/音乐检测器
CN101256772A (zh) * 2007-03-02 2008-09-03 华为技术有限公司 确定非噪声音频信号归属类别的方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1175398C (zh) * 2000-11-18 2004-11-10 中兴通讯股份有限公司 一种从噪声环境中识别出语音和音乐的声音活动检测方法
CN100505040C (zh) * 2005-07-26 2009-06-24 浙江大学 基于决策树和说话人改变检测的音频分割方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1920947A (zh) * 2006-09-15 2007-02-28 清华大学 用于低比特率音频编码的语音/音乐检测器
CN101256772A (zh) * 2007-03-02 2008-09-03 华为技术有限公司 确定非噪声音频信号归属类别的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"ITU-T G.729: Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction(CS-ACELP)", TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU, January 2007 (2007-01-01) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9548713B2 (en) 2013-03-26 2017-01-17 Dolby Laboratories Licensing Corporation Volume leveler controller and controlling method
US9923536B2 (en) 2013-03-26 2018-03-20 Dolby Laboratories Licensing Corporation Volume leveler controller and controlling method
US10411669B2 (en) 2013-03-26 2019-09-10 Dolby Laboratories Licensing Corporation Volume leveler controller and controlling method
US10707824B2 (en) 2013-03-26 2020-07-07 Dolby Laboratories Licensing Corporation Volume leveler controller and controlling method
US11218126B2 (en) 2013-03-26 2022-01-04 Dolby Laboratories Licensing Corporation Volume leveler controller and controlling method
US11711062B2 (en) 2013-03-26 2023-07-25 Dolby Laboratories Licensing Corporation Volume leveler controller and controlling method
CN113539262A (zh) * 2021-07-09 2021-10-22 广东金鸿星智能科技有限公司 一种用于电动门语音控制的声音增强及收录方法和系统
CN113539262B (zh) * 2021-07-09 2023-08-22 广东金鸿星智能科技有限公司 一种用于电动门语音控制的声音增强及收录方法和系统

Also Published As

Publication number Publication date
CN101399039B (zh) 2011-05-11
CN101399039A (zh) 2009-04-01

Similar Documents

Publication Publication Date Title
WO2009046658A1 (fr) Procédé et appareil de détermination du type d&#39;un signal audio non-bruit
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
CN106409313B (zh) 一种音频信号分类方法和装置
WO2008067735A1 (fr) Procédé et dispositif de classement pour un signal sonore
Pearce et al. Aurora working group: DSR front end LVCSR evaluation AU/384/02
US8275609B2 (en) Voice activity detection
RU2417456C2 (ru) Системы, способы и устройства для обнаружения изменения сигналов
TWI722349B (zh) 語音活動偵測系統
WO2014153800A1 (zh) 语音识别系统
EP2089877A1 (en) Voice activity detection system and method
KR20150063423A (ko) Asr을 위한 개선된 하이브리드 컨트롤러
Gruenstein et al. A cascade architecture for keyword spotting on mobile devices
JP2006215564A (ja) 自動音声認識システムにおける単語精度予測方法、及び装置
Padmanabhan et al. Large-vocabulary speech recognition algorithms
CN115428066A (zh) 合成语音处理
Wu et al. An efficient voice activity detection algorithm by combining statistical model and energy detection
US20220399007A1 (en) System and method for robust wakeword detection in presence of noise in new unseen environments without additional data
Górriz et al. An effective cluster-based model for robust speech detection and speech recognition in noisy environments
CN113823265A (zh) 一种语音识别方法、装置和计算机设备
Yuan et al. Speech recognition on DSP: issues on computational efficiency and performance analysis
Van Gysel Modeling Spoken Information Queries for Virtual Assistants: Open Problems, Challenges and Opportunities
CN1275223C (zh) 一种低比特变速率语言编码器
Pan et al. The application of improved genetic algorithm on the training of neural network for speech recognition
KR100764346B1 (ko) 구간유사도 기반의 자동 음악요약 방법 및 시스템
Kathania et al. Soft-weighting technique for robust children speech recognition under mismatched condition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08800944

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08800944

Country of ref document: EP

Kind code of ref document: A1