CN1623186A - Voice activity detector and validator for noisy environments - Google Patents

Voice activity detector and validator for noisy environments Download PDF

Info

Publication number
CN1623186A
CN1623186A CN 03802682 CN03802682A CN1623186A CN 1623186 A CN1623186 A CN 1623186A CN 03802682 CN03802682 CN 03802682 CN 03802682 A CN03802682 A CN 03802682A CN 1623186 A CN1623186 A CN 1623186A
Authority
CN
China
Prior art keywords
speech
input
frame
energy
communication unit
Prior art date
Application number
CN 03802682
Other languages
Chinese (zh)
Other versions
CN1307613C (en
Inventor
道格拉斯·拉尔夫·伊利
霍利·路易斯·凯莱赫
戴维·约翰·本杰明·皮尔斯
Original Assignee
摩托罗拉公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to GB0201585A priority Critical patent/GB2384670B/en
Application filed by 摩托罗拉公司 filed Critical 摩托罗拉公司
Publication of CN1623186A publication Critical patent/CN1623186A/en
Application granted granted Critical
Publication of CN1307613C publication Critical patent/CN1307613C/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold

Abstract

一种通信单元(100),包括带有话音活动检测机制(130,135)的音频处理单元(109)。 A communication unit (100), including a voice activity detector with a mechanism (130, 135) of the audio processing unit (109). 话音活动检测机制(130,135)测量输入至通信单元(100)中的信号的能量加速率,并根据所述测量确定所述输入信号是语音还是噪声。 Voice activity detection mechanism (130, 135) to the communication unit energy input measured acceleration rate signal (100), and determining from the measurements the input signal is speech or noise. 还描述了一种检测话音的方法和一种判决输入信号是语音还是噪声的方法。 Also described is a voice detecting method and a judgment method of the input signal is speech or noise. 使用基于能量加速率的话音活动检测器和验证器,特别对于噪声环境,提供了噪声鲁棒性、快速响应和输入语音电平独立的优点。 Based voice activity detector and verifier rate of acceleration energy, especially for noise environment, noise robustness is provided, independent of the advantages of fast response and speech level input.

Description

用于噪声环境的话音活动检测器和验证器 Noise environments for voice activity detector and verifier

技术领域 FIELD

本发明涉及噪声环境内的语音的检测(通常称为话音活动检测(VAD))。 The present invention relates to the detection of speech in a noisy environment (commonly referred to as a voice activity detector (VAD)). 本发明适用于(但并不限于)语音检测系统中的话音信号的能量加速率测量。 The present invention is applicable to (but not limited to) the energy of the speech signal in the voice detection system acceleration measurement.

背景技术 Background technique

许多话音通信系统,例如针对个人移动无线用户的全球移动通信系统(GSM)蜂窝电话标准和陆地中继无线(TETRA)系统使用语音处理单元来编码和解码语音模式。 Many voice communication systems, e.g. (GSM) cellular telephone standard and a terrestrial radio relay (the TETRA) systems use speech processing units to encode and decode speech patterns for the individual GSM mobile radio users. 在这种话音通信系统中,语音编码器把模拟语音模式转换为传输用的合适的数字格式。 In such a voice communication system, a speech encoder converts the analog voice mode is converted into digital format suitable of transmission. 语音解码器把接收的数字语音信号转换为音频模拟语音模式。 The speech decoder converts a received digital voice signal into an analog audio voice mode.

用于检测话音活动的方法和设备在本技术领域中已公知。 A method and apparatus for detecting voice activity has been well known in the art. 话音活动检测器(VAD)在假设语音只存在于音频信号的一部分中的假设下工作。 A voice activity detector (VAD) in the presence of the assumption that only the working part of the audio signal in the speech hypothesis. 这个假设通常是正确的,因为许多音频信号间隔只具有静音或背景噪声。 This assumption is usually correct, because many audio signal interval having only background noise or silence.

话音活动检测器可以用于许多目的。 A voice activity detector can be used for many purposes. 这些包括当在没有语音时抑制传输系统中的整个传输活动,从而潜在地节约了功率和信道带宽。 These include inhibition when the entire transmission system transmission activity in the absence of speech, thereby potentially saving power and channel bandwidth. 当VAD检测到语音活动继续进行时,能够重新开始传输活动。 When the VAD detects voice activity continued, transport activities can resume.

话音活动检测器还可以与语音存储设备结合使用,把包括语音的音频部分与“无语音”部分区分开。 Voice activity detector may also be used in conjunction with the voice storage device to separate the audio portion includes speech and the "non-speech" section. 包括语音的部分后来被存储在存储设备中而“无语音”部分被丢弃。 Including voice portion later in the storage device while the "non-speech" storage portion is discarded.

用于检测话音的现有方法至少部分地基于用于检测和估算语音信号的功率的方法。 Existing methods for detecting speech based at least in part for detecting the power and the estimated speech signal. 估算的功率与一常数或一自适应门限比较,以作出该信号是否是语音的判决。 Estimated with a constant power or an adaptive threshold comparator, made of whether the signal is speech decision. 这些方法的主要优点在于其低复杂度,这使得它们适用于低处理资源的实施。 The main advantage of these methods is their low complexity, which makes them suitable for low processing resources embodiment. 这种方法的主要缺点是背景噪声可能无意中导致在实际上没有“语音”的时候检测到“语音”。 The main disadvantage of this approach is that background noise may inadvertently lead to detection of a "voice" when in fact there is no "voice" of. 另外,因为含糊不清,实际存在的“语音”可能未被检测到,并且由于背景噪声而导致难以检测到。 In addition, because of ambiguity, the actual existence of "voice" may not be detected, and because of background noise makes it difficult to detect.

用于检测语音活动的一些方法针对于噪声移动环境且基于语音信号的自适应滤波。 Some methods for detecting voice activity for the adaptive filtering of the noise and the mobile environment based on the voice signal. 这在最终判决之前降低了来自该信号的噪声内容。 This reduces the noise content from the signal prior to the final judgment. 由于该方法用于不同的说话者和不同的环境,所以频谱和噪声电平可能发生改变。 Since the method for different speakers and different environments, so the spectrum and the noise level may change. 因此,输入滤波器和门限通常是自适应的,以跟踪这些变化。 Thus, the input filter and the adaptive threshold is usually to track these changes.

这些方法的示例在分别用于半速率、全速率和增强全速率语音业务信道的GSM规范06.42话音活动检测器(VAD)中提供。 Examples of such methods are provided for half rate, full rate and enhanced full rate speech traffic channels in the GSM specification 06.42 voice activity detector (VAD) in the. 另一这种方法是ITU G.729附录B中所建议的“Multi-Boundary Voice ActivityDetection Algorithm(多界限话音活动检测算法)”。 Another such method is ITU G.729 Appendix B of the proposed "Multi-Boundary Voice ActivityDetection Algorithm (multiple boundaries voice activity detection algorithm)." 这些方法在噪声环境中很准确,但是实施起来很复杂。 These methods are very accurate in noisy environments, but complex to implement.

所有这些方法都需要输入语音信号。 All of these methods require the input speech signal. 采用语音解压缩方案的一些应用在语音解压缩处理期间需要执行语音检测。 Some applications employ voice decompression scheme during a voice decompression process needs to perform speech detection.

Benyassine等人的欧洲专利申请No.EP-A-0785419涉及一种用于话音活动检测的方法,该方法包括以下步骤:(i)从每帧的呼入语音信号中提取出预定集的参数,以及(ii)根据从预定集的参数中提取出的偏差测量集来对每帧的呼入语音信号作出帧话音判决。 Benyassine et al in European Patent Application No.EP-A-0785419 relates to a method for detecting voice activity, the method comprising the steps of: (i) a predetermined set of parameters extracted from each frame of the incoming speech signal, and (ii) based on the extracted parameters from the predetermined set of current deviation measurements to make judgment on the speech frame of each frame of the incoming speech signal.

蜂窝系统中的VAD进行偏置,以确保当一方说话时,包括语音编解码器和RF电路等的无线设备被激活,以把该语音传送至背景噪声及其它损伤环境中的另一方。 VAD in cellular systems is biased in order to ensure that when one speaker, comprising a voice codec and RF circuitry and other wireless devices are activated, to transmit the speech to background noise and other damage to the environment on the other. 但是,这导致在一方没有说话时出现数据传输。 However, this results in data transmission occurs when one party did not speak. 这种方法的代价是稍微降低了电池寿命和稍微增加了对该系统的其它单元中的同信道用户的干扰。 The cost of this approach is somewhat reduced battery life and slightly increased interference to the other units in the system co-channel users. 这些基本上是第二(或更高)阶效应。 These are essentially second (or higher) order effects.

在这些系统中,没有对有限资源可用于双工呼叫的构思。 In these systems, there is no concept of the limited resources available to the duplex call. 通常在不同载波上的上行链路和下行链路完全可以一致同时使用整个带宽。 Typically the uplink and downlink on different carriers can simultaneously use the whole bandwidth consistent.

在本发明的领域中已公知,一些话音活动或话音开始检测器(VAD/VOD)试图使用诸如谐波结构(例如通过自相关)的语音特性来辨别浊音语音(voiced speech)。 Are well known in the art of the present invention, some start voice activity detector or a voice (VAD / VOD), such as attempting to use the harmonic structure (e.g. autocorrelation) to identify voice characteristics voiced speech (voiced speech). 但是,在噪音中,由于语音结构的破坏或由于噪声中的结构,这些结构指示符可能失效。 However, in the noise, due to the destruction of the structure of a voice or noise due to the structure, the structure of these indicators may fail. 这例如可以是汽车中的引擎、轮胎或空调噪声。 This can be an automobile engine, tires or air conditioning noise. 最后,这些方法在检测清音语音(unvoiced speech)方面上较弱。 Finally, these methods are weaker in the unvoiced speech detector (unvoiced speech) aspects.

其替换物只是使用帧能量级来检测语音。 Except that alternative energy level detected speech frame. 这对于高信噪比(SNR)条件的语音是令人满意的,其中,可以设置高于噪声电平的任意门限来表示语音。 This speech noise ratio (SNR) conditions are satisfactory, which can be provided above the noise threshold level to represent any speech. 但是,这种方法在很多实际噪声条件中失效。 However, this method fails in many actual noise conditions.

对于非归一化的数据库或在实际应用中,一个示例集中的噪声电平很可能比另一示例集中的语音电平高,这使得不能设置门限值。 For non-normalized database or in practical application, a noise level sample concentration is may be higher than the concentration of another example of the speech level, which makes it impossible to set a threshold value. 克服这个问题的现有方法是取话语的大约第一个100毫秒的平均值,假定这代表噪声,从而创建用于该话语的特定门限。 Conventional method to overcome this problem is about the first 100 ms of averaged utterance, which represents the noise is assumed to create a certain threshold for that utterance. 但是,此外,这对于非平稳噪声是不够的,其中该噪声可能迅速偏离初始估计值,其中该噪声具有高方差或其中第一少数帧实际上包含不是假定噪声的语音。 However, in addition, it is not enough for the non-stationary noise, where the noise may be rapidly from the initial estimate, wherein the noise variance, or having a high minority wherein the first frame is assumed that the noise is not actually contain speech.

因此,需要有一种用于噪声环境的经改善的话音活动检测器和验证器,其可以缓和上述缺点。 Accordingly, there is need for an improved voice activity detector and verifier for a noise environment, which can alleviate the above mentioned disadvantages.

发明内容 SUMMARY

根据本发明的第一方面,提供了一种如权利要求1所述的通信单元。 According to a first aspect of the present invention, there is provided a communication unit as claimed in claim 1.

根据本发明的第二方面,提供了一种如权利要求11所述的检测输入到通信单元中的语音信号的方法。 According to a second aspect of the present invention, there is provided a method as claimed in claim 11, wherein detecting the input speech signal to the communication unit.

根据本发明的第三方面,提供了一种如权利要求14所述的确定输入到通信单元中的信号是语音还是噪声的方法。 According to a third aspect of the present invention, there is provided a method of determining as claimed in claim 14, the signal input to the communication unit is a speech or noise method.

本发明的其它方面如其从属权利要求中所述。 Other aspects of the invention as it in the dependent claims.

总之,本发明旨在通过使用能量加速率测量(优选为能量幅度测量)来解决任意幅度的非平稳噪声的情况,以表示存在或不存在语音。 In short, aimed to solve the nonstationary noise by using the amplitude of the energy of any acceleration measurement (magnitude of energy is preferably measured) of the present invention, to indicate the presence or absence of speech.

附图说明 BRIEF DESCRIPTION

现在参考附图对本发明的示例性实施例进行描述,在附图中:图1示出了适用于执行本发明的优选实施例的话音活动检测和验证的通信单元的方框图;图2示出了根据本发明的优选实施例的用于噪声环境的基于能量加速率的话音活动检测器的流程图;图3示出了根据本发明的优选实施例的用于噪声环境的基于能量加速率的话音活动验证的流程图;以及图4示出了根据本发明的优选实施例的缓冲器操作。 Referring now to the drawings exemplary embodiments of the present invention are described in the accompanying drawings in which: Figure 1 shows a block diagram of the present invention are suitable for performing voice activity detection means is preferably a communication and verification embodiment; Figure 2 illustrates the flowchart of a voice activity detector for the energy rate of acceleration noise environment of the preferred embodiment of the present invention; FIG. 3 shows a speech for noisy environments according to a preferred embodiment of the present invention based on the rate of acceleration energy verify flowchart activity; and FIG. 4 shows the operation of the buffer according to a preferred embodiment of the present invention.

具体实施方式 Detailed ways

浊音语音具有相对较高的能量加速率值,因为浊音语音的开始依赖于或振动或静止的声带的活动。 Voiced speech having relatively high energy acceleration values, as voiced speech is started depends on the activity or vocal vibration or stationary. 类似地,清音的开始(例如爆破音)也具有高能量加速率。 Similarly, start unvoiced (e.g. plosive) also has a high rate of acceleration energy.

本发明人已意识到,在代表性的有明显语音特征的域中,例如窄带功率谱或Mel频谱,所得的能量加速率大大高于非平稳噪声。 The present inventors have realized that, in a clear field representative speech features, e.g. Mel spectrum or narrowband power spectrum, the resulting acceleration rate is much higher than the energy of non-stationary noise. 唯一主要的例外是冲击噪声(例如鼓掌)。 The only major exception is impact noise (such as clapping).

因此,根据本发明的优选实施例,本发明人已发现通过集中可能含有话音信号的基本基音的频率区中的能量,而能够另外与这些噪声区分开。 Thus, according to a preferred embodiment of the present invention, the present inventors have found that by concentrating energy frequency bins may contain substantially the pitch of the speech signal, and these can be further separated from the noise. 具体地说,本发明的发明人建议使用语音的非结构特征,即能量加速率(或反映语音能量或其分量的一些度量的加速率)。 More specifically, the present invention suggests the use of non-structural features of speech, i.e., the rate of acceleration energy (or reflected speech energy, or some component of the acceleration rate metric).

具体地说,对于在此所描述的发明构思的优选应用是目前正由欧洲电信标准协会(ETSI)所定义的分布式语音识别(DSR)标准:“SpeechProcessing;Transmission and Quality aspects(STQ);Distributed speechrecognition;Front-end feature extraction algorithm;Compressionalgorithm(语音处理、传输和质量方面(STQ);分布式语音识别;前端特征提取算法;压缩算法)”,ETSI ES 201 108 vl.1.2(2000-04),2000年4月。 Specifically, for the preferred application of the inventive concept herein described it is currently defined by the European Telecommunications Standards Institute (ETSI) Distributed Speech Recognition (DSR) Standard: "SpeechProcessing; Transmission and Quality aspects (STQ); Distributed speechrecognition; Front-end feature extraction algorithm; Compressionalgorithm (speech processing, transmission and quality aspects (STQ); distributed speech recognition; front end feature extraction algorithm; compression algorithms) ", ETSI ES 201 108 vl.1.2 (2000-04), April 2000.

现在参考图1,示出了适用于支持本发明的优选实施例的发明构思的音频用户单元100的方框图。 Referring now to Figure 1, there is shown a block diagram of the present invention is adapted to support a preferred embodiment of the inventive concept, the subscriber unit 100 of the audio.

根据无线音频通信单元来描述本发明的优选实施例,例如能够在用于未来蜂窝无线通信系统的第三代合作项目(3GPP)标准下运行且提供DSR能力的无线音频通信单元。 The wireless audio communication unit of the present preferred embodiments of the invention are described, for example, capable of operating at a future cellular radio communication system of Third Generation Partnership Project (3GPP) standards, providing the wireless communication unit DSR audio capabilities. 但是,在此所描述的关于话音活动检测和验证的发明构思同样适用于响应话音信号且可以从经改善的话音活动检测电路中获益的任何电子器件,这也在本发明的范围之内。 However, the inventive concept on voice activity detection and verification described herein are equally applicable to any electronic device and may benefit from an improved voice activity detection circuit in response to the voice signal, which is also within the scope of the present invention.

如在本技术领域中已知,音频用户单元100包含优选地连接至双工滤波器、天线开关或循环器104的天线102,循环器104使音频用户单元100内的接收链和发送链之间隔离。 As known in the art, an audio user unit comprises an antenna 100 preferably coupled to a duplex filter or antenna switch 104, circulator 102, so that the cycle between receive chain 104 and transmit chain 100 within an audio user unit isolation.

接收器链包括接收器前端电路106(有效提供接收、滤波和中频或基带频率转换)。 The receiver chain includes receiver front-end circuitry 106 (effectively providing reception, filtering and intermediate or baseband frequency conversion). 前端电路106串联连接至信号处理功能块(一般由数字信号处理器(DSP)实现)108。 A front end circuit 106 is serially coupled to a signal processing function block (generally realized by a digital signal processor (DSP)) 108. 信号处理功能块108执行信号解调、纠错和格式化。 The signal processing function block 108 performs signal demodulation, error correction and formatting. 从信号处理功能块108恢复的数据串联连接至音频处理功能块109,其以合适的方式格式化接收信号,以发送至音频发音器/显示器111。 Serial data signal processing block 108 connected to the audio recovered from processing block 109, which formats a received signal in a suitable manner, to transmit the sound to the audio / display 111.

在本发明的不同实施例中,信号处理功能块108和音频处理功能块109可以设置在相同的物理设备内。 In various embodiments of the present invention, the signal processing block 108 and the audio processing block 109 may be provided within the same physical device. 控制器114被安置来控制用户单元100的组件的信息流和运行状态。 The controller 114 is arranged to control the flow of information and a subscriber unit operating state of the component 100.

至于发送链,这基本上包括音频输入设备120,其串联连接音频处理功能块109、信号处理功能块108、发射器/调制电路122和功率放大器124。 As regards the transmit chain, this essentially includes an audio input device 120, which are connected in series audio processing block 109, signal processing block 108, transmitter / modulation circuitry 122 and a power amplifier 124. 处理器108、发射器/调制电路122和功率放大器124可操作地响应控制器。 Processor 108, transmitter / modulation circuitry 122 and power amplifier 124 are operationally responsive to the controller. 功率放大器输出被连接至双工滤波器、天线开关或循环器104以及天线102,以发射最终的射频信号。 The power amplifier output is connected to a duplex filter or antenna switch 104, and the loop antenna 102, a radio frequency signal to transmit the final.

具体地说,音频处理功能块109包括话音活动(或话音开始)检测(VAD)功能块130,其操作地连接至话音活动判决功能块135。 Specifically, the audio processing block 109 comprises a voice activity (or speech starts) detection (VAD) function block 130, which is operatively connected to a voice activity decision function block 135. 根据本发明的优选实施例,VAD功能块130和话音活动判决功能块135适用于提供经改善的话音检测和判决机制,其操作将根据图2和图3得到进一步的描述。 According to a preferred embodiment of the present invention, the VAD function block 130 and a voice activity decision function block 135 applies speech detector and to provide an improved decision mechanism, the operation will be further described in accordance with FIGS. 2 and 3. 应当注意,话音活动检测器功能块130包括由三个测量组成的逐帧检测阶段。 It should be noted that voice activity detector function block 130 comprises three phase measurement is detected from frame to frame thereof. 这三个频率范围测量包括:(i)整个频谱;(ii)频谱子频段;以及 This measurement includes three frequency ranges: (i) the entire spectrum; (ii) spectral sub-band; and

(iii)频谱方差。 (Iii) the variance of the spectrum.

接着,话音活动判决功能块135根据测量的缓冲器来执行判决,分析其语音似然性。 Next, the voice activity decision function block 135 performs the judgment based on the measured buffer, analyzed speech likelihood. 判决阶段的最终判决的应用可追溯至缓冲器中的最早的帧。 Application of the final judgment sentencing stage can be traced back to the earliest frame buffer.

在本发明的优选实施例中,计时器/计数器118也适用于执行图2和图3的检测和判定处理中的定时功能。 In a preferred embodiment of the present invention, the timer / counter 118 is also adapted to perform the detection and determination process timing functions in FIGS. 2 and 3.

信号处理器功能块108、音频处理功能块109、VAD功能块130和话音活动判决功能块135可以实现为不同的、操作地连接的处理组件。 The signal processor function block 108, an audio processing block 109, VAD function block 130 and a voice activity decision function block 135 may be implemented as different, the processing component is operatively connected. 另外,一个或多个处理器可以用来实现一个或多个对应的处理操作。 Additionally, one or more processors may be used to implement one or more corresponding processing operations. 在另一替换实施例中,上述功能块可以实现为硬件、软件或固件组件的混合,使用专用集成电路(ASIC)和/或处理器,例如数字信号处理器(DSP)。 In another alternative embodiment, the above-described functional blocks may be implemented as a mix of hardware, software or firmware components, application specific integrated circuit (ASIC) and / or processor, such as a digital signal processor (DSP).

当然,音频用户单元100内的各种元件可以实现为分开的或集成元件形式,因此最终结构只是任意选择的结果。 Of course, the various elements within the audio user unit 100 may be implemented as a result of a separate or integrated component form, so the final structure is only arbitrarily selected.

为了实现此目的,存在获得在本发明的优选实施例中使用的能量加速率指示的方法。 To achieve this, there is a method for obtaining the energy used in the acceleration rate indication in a preferred embodiment of the present invention.

(i)理论上理想的方法是在话语的连续帧上精确地求能量级的二次导数(double-differentiate),如在先公开的申请US 6009391所示。 (I) theoretically ideal approach is to accurately find the second derivative of the energy levels (double-differentiate) in the successive frames of the utterance, as disclosed in the prior application shown in US 6009391. 这种方法的缺点是这可能引起延迟,因为在分析时需要分析该帧的每侧的多个帧。 A disadvantage of this method is that it can cause delays, because in analysis requires a plurality of frames on each side of the frame.

(ii)能量加速率的零延迟估计可以通过把短时平均值与瞬时值比较来获得,例如:使用帧平均:A~=xt(xt+xt-1+···+xt-n)/(n+1)---[1]]]>或使用滚动平均: (Ii) the energy to accelerate the rate of zero-delay estimation by the short-time average of the instantaneous value is obtained by comparing, for example: using the frame Average: A ~ = xt (xt + xt-1 + & CenterDot; & CenterDot; & CenterDot; + xt- n) / (n + 1) --- [1]]]> or rolling average:

A~=xt(axt+bxt-1+&CenterDot;&CenterDot;&CenterDot;+kxt-n)---[2]]]>在每个情况下,该方法返回其可以解释为'减速率'<'1'<'加速率'的值。 A ~ = xt (axt + bxt-1 + & CenterDot; & CenterDot; & CenterDot; + kxt-n) --- [2]]]> In each case, the method returns to which may be interpreted as' deceleration rate '<' '<' acceleration rate value '1. 然后可以找到 You can then find 的经验值和把语音和噪声最好地区分开的分母长度。 The experience and to separate the speech and noise the best areas denominator length.

本发明的发明人已意识到,优选的最佳解决方案是找出可以快速跟踪非平稳噪声的分母,但是其对于跟踪话音开始来说太长了。 The present inventors have realized that the preferred solution is to find out the best fast track non-stationary noise denominator, but it is too long for tracking voice began speaking. 对于滚动平均的建议的值序列是a=0.2、b=0.8×a、c=0.8×b等,其可以简单地表示为递归式:dt=0.2xt+0.8dt-1[3]则:A=xt/dt[4]检测阶段内的优选VAD和参数初始化系统在图2的流程图中概括出。 For the rolling average value of the recommended sequence a = 0.2, b = 0.8 × a, c = 0.8 × b and the like, which may simply be expressed as a recursive formula: dt = 0.2xt + 0.8dt-1 [3] is: A = xt / dt [4] and preferably VAD system initialization parameter in the detection phase are summarized in the flowchart of FIG. 2. 在非平稳噪声中,长时能量门限不是语音的可靠指示。 In non-stationary noise, long energy threshold is not a reliable indication of speech. 类似地,在高噪声条件下,语音的结构(例如谐音)不能整个地依赖于指示,因为其可能受噪声破坏,或者结构噪声可能使检测器混淆。 Similarly, under high noise conditions, the structure of speech (e.g. harmonics) can not be entirely dependent on the indication, since it may be damaged by noise or structure-borne noise may cause confusion detector. 因此,优选的话音活动检测器使用语音的噪声鲁棒性(noise-robust)特征,即与语音开始有关的能量加速率。 Accordingly, it is preferred to use a voice activity detector and noise robustness of speech (noise-robust) characterized in that the energy associated with the beginning of speech acceleration rate.

现在参考图2,示出了优选检测处理的流程图200。 Referring now to Figure 2, a flowchart illustrating a preferred process 200 of detection. 如上所指出,该处理包括逐帧分析。 As indicated above, the process comprising a frame by frame analysis. 优选VAD机制涉及'整个频谱'的测量处理。 Preferably VAD mechanism involves 'entire spectrum' measurement process. 初始估算帧计数器来确定其是否小于'N',其限定了缓存帧的数目,如步骤205所示。 Initial estimate of the frame counter to determine whether it is less than 'N', which defines the number of frames in the cache, as shown in step 205. 作为优选实施例的示例,'N'设置为'15',假定设定为每帧递增例如10毫秒。 As an example of the preferred embodiment, 'N' is set to '15', assuming each frame is set to, for example, 10 msec increments. 如果在步骤205中帧计数器小于'N',则更新初始加速率测试的滚动平均值,如步骤210。 If the frame counter at step 205 is less than 'N', then the updated rolling average initial rate of acceleration test, step 210. 如果在步骤205中帧计数器不小于'N',则跳过步骤210。 If not less than 'N' frame counter in step 205, step 210 is skipped.

然后,作出估算能量加速率测量是否在一个或多个指定限度之内的确定,如步骤235所示。 Then, acceleration measurements to estimate the energy is within one or more specified limits of determination, as shown in step 235. 如果在步骤235中能量加速率测量在一个或多个指定限度之内,则用进一步的能量加速率测试的结果来更新滚动平均值,如步骤240。 If in step 235 the acceleration rate measured energy within one or more specified limits, the rolling average is updated with the result of further testing of the energy rate of acceleration, step 240. 如果在步骤235中能量加速率测量不是在一个或多个指定限度之内,则跳过步骤240。 If the acceleration rate measured in step 235 is not the energy in one or more specified limits, step 240 is skipped.

然后,作出估算能量加速率测量是否大于指定门限的确定,如步骤260所示。 Then, to measure the rate of acceleration estimated energy is greater than the specified threshold is determined, as shown in step 260. 如果在步骤260中能量加速率测量大于指定门限,则认为该帧是语音帧,如步骤265。 If the acceleration energy measurement in step 260 is greater than the specified threshold, it is considered that the frame is a speech frame, as shown in step 265. 如果在步骤260中能量加速率测量不大于指定门限,则认为该帧为噪声帧,如步骤270。 If the threshold is not greater than a specified rate of acceleration energy measurement step 260, the frame is deemed as a noise frame, step 270.

然后递增帧计数器,如步骤275,且该处理从步骤205开始重复。 Incrementing the frame counter is then, in step 275, and the process is repeated from step 205.

作为对该处理的改善,替代或除此之外,还可以执行整个频谱测量处理,如可选步骤215和245所示的子区测量处理。 As the improved process, instead of or in addition to this, the entire spectrum measurement process can be performed as an optional step 215 and sub-regions 245 as shown in the measurement process. 频谱的特定子区被选为很可能包含基本基音的子区。 Particular sub-region of the spectrum are selected as likely to contain basic pitch of the sub-region.

在该子区处理中,当在步骤210中在整个频谱测量中更新初始加速率测试的滚动平均时,作出检查能量加速率测量是否大于门限值的确定,如步骤220所示。 In the sub-region processing, when at step 210 to update the initial rate of acceleration in the entire spectrum of the measurement test rolling average, a check is made to determine whether the energy measured acceleration rate is greater than the threshold value, as shown in step 220. 如果在步骤220中该能量加速率测量大于该门限值,则挂起初始化其它参数的处理,如步骤225所示。 If at step 220 the acceleration of the energy measure is larger than the threshold value, the initialization process is suspended other parameters, as shown in step 225. 如果在步骤220中该能量加速率测量不大于该门限值,则更新其它参数的初始化,如步骤230。 If at step 220 the acceleration of the energy measurement is not greater than the threshold value, initializing the update other parameters, step 230. 然后该处理返回至步骤235,如所示。 The process then returns to step 235, as shown in FIG.

在步骤235中作出估算能量加速率测量是否在一个或多个指定限度之内的确定之后作出又一优选确定。 After a further determination is preferably made in determining whether one or more specified limits estimated energy made in step 235 acceleration measurements. 估算该减速率值来确定其在步骤250中是否是'高'的,且如果是这样的话,则缓慢地更新能量加速率测试的滚动平均,如步骤255所示。 The estimated value of the deceleration rate to determine whether a step 250 which is 'high', and if so, then gradually updated rolling average energy of the accelerated rate test, as shown in step 255. 然后该处理在步骤260返回至整个频谱方法。 The process then returns to step 260 in the method of the entire frequency spectrum.

通过这样的方式,子区检测器的较高信噪比(SNR)使其具有较高的噪声鲁棒性。 In this way, the sub-region detector higher signal to noise ratio (SNR) to have a high noise robustness. 但是,其容易受不利的麦克风和说话者变化以及限带噪声的影响。 However, it is easy to unfavorable changes in microphone and speaker, and band-limited noise impact by. 因此,该测量不应当依赖于所有的环境。 Accordingly, the measurement should not be dependent on any conditions. 因此,本发明的优选实施例合并了子区检测器,以加强整个频谱测量。 Accordingly, it preferred embodiments of the present invention incorporating a detector sub-region to enhance the entire spectrum measurement.

又一测量处理优选地使用例如每帧的频谱的下半部分内的值方差的'加速率'来执行。 A further measurement process is preferably used, for example, the variance of the lower half of the spectrum for each frame 'acceleration rate' is performed. 该方差测量检测频谱的下半部分内的结构,使其对浊音语音高度敏感。 The variance of the measurement structure of the lower half of the spectrum detected, so highly sensitive to voiced speech. 方差测量遵循子区处理的方法,频谱的下半部分是选择的特定子区。 Variance measurement method follows the sub-region process, the lower half of the spectrum is a specific sub-region selection. 这个方差测量进一步补充了整个频谱测量方法,其能够更好地检测清音和爆破音语音。 This further complements the overall variance measured spectrum measuring method, it is possible to better detect unvoiced plosives and voice.

所有这三个测量从由双重维纳滤波器的第一阶段产生的滤波器增益的谱表示中取出其原始输入,如申请人为摩托罗拉公司且发明人为Yan-Ming Chen的US 09/427497的美国专利申请中所描述。 All three measurements taken from a spectral representation generated by filter gain of a first stage double its original Wiener filter input, to the applicant and inventor Motorola Yan-Ming Chen in U.S. Patent No. US 09/427497 described herein. 如上所述,每个测量使用这个数据的不同方面。 As described above, each measuring different aspects of using this data.

具体地说,整个频谱检测器使用已知的由双重韦纳滤波器的第一阶段产生的滤波器增益的Mel滤波的谱表示。 Specifically, Mel filtering of detector uses the entire spectrum of the known filter gain produced by the first stage of a double spectral representation Weiner filter. 单个输入值是通过对Mel滤波器组的和进行平方而获得的。 A single input value by squaring and Mel filter bank obtained.

在本发明的优选实施例中,整个频谱检测器向所有帧应用了下面的处理,如下所述:步骤一以下述的方式初始化噪声估计跟踪值(Tracker):如果帧数<15且加速率<2.5,则跟踪值=MAX(跟踪值,输入)。 In a preferred embodiment of the present invention, the entire spectrum detector application to all frames of the following process, as follows: a step of initialization in the following manner tracking noise estimate value (Tracker): If the number of frames <15 and acceleration rate < 2.5, the tracking value = MAX (tracking value, input).

如果语音在15帧的导入时间内发生,则能量加速率测量防止跟踪值被更新。 If speech occurs in the lead time 15, the acceleration rate measured energy value is updated to prevent tracking.

如果当前输入与噪声估值相同,则步骤二以下面的方式更新跟踪值:如果输入<跟踪值×上限且输入>跟踪值×下限,则跟踪值=a×跟踪值+(1-a)×输入步骤三对那些第一少数帧内存在语音或无特征大噪声内容的实例提供了故障保险机制。 If the current input noise estimate the same, in the following two steps update the tracking value: If the input <× the upper limit value and the input trace> × tracking limit value, the tracking value = a × tracking value + (1-a) × an input step of providing a fail-safe mechanism three pairs those instances wherein the presence or absence of noisy speech content of the first few frames. 这致使所得的错误高噪声估值降低。 This leads to errors resulting high noise valuations lower. 步骤三优选地以下面的方式进行:如果输入<跟踪值×最低值(Floor),则跟踪值=b×跟踪值+(1-b)×输入如果当前输入比跟踪值大165%,则步骤四以下面的方式返回,作为'真'语音确定:如果输入>跟踪值×门限,则输出'真',否则输出'假'。 Step three is preferably in the following manner: If the input <tracking value × Minimum value (Floor), the tracking value = b × tracking value + (1-b) × Input If the current input than the track value larger 165%, then step four returned in the following manner, as 'true' voice OK: if the input> × tracking threshold value, outputs 'true', otherwise the output is 'false'.

瞬时输入与短时均值跟踪值的比率是连续输入的能量加速率的函数。 The ratio of the instantaneous input value is the short-time average of the tracking function of the energy input of successive acceleration rate.

其中,在上述中:a=0.8且b=0.97;上限是150%且下限是75%;最低值是50%;且门限是165%。 Wherein, in the above: a = 0.8 and b = 0.97; upper limit is 150% and the lower limit is 75%; the lowest value is 50%; and a 165% threshold.

应当注意,如果该值大于上限或在下限和最低值之间,则不更新。 It should be noted that, if the value is greater than the upper limit and the lower limit or between the lowest value is not updated. 此外,如上所指出,能量加速率输入可以根据下述的方式计算:在连续输入上二次求导或通过跟踪输入的两个滚动平均的比率来估算。 Further, as noted above, to accelerate the rate of the input energy may be calculated according to the following manner: two rolling average of a ratio of the second derivative in continuous input or inputs estimated by tracking.

应当注意,快速和缓慢自适应滚动平均的比率反映了连续输入的能量加速率。 It should be noted that fast and slow adaptive rolling average of the energy ratios reflect the acceleration rate of the continuous input.

例如,上面所使用的对于该平均数的贡献率是:(i)0×均值+1×输入,且(ii)((帧数-1)×均值+1×输入)/帧数,使能量加速率测量对首十五帧越来越敏感。 For example, the contribution rate to the average used above are: (i) 0 × mean + 1 × input, and (II) ((frame -1) × mean + 1 × input) / frame, the energy acceleration measurements are increasingly sensitive to the first fifteen.

该子频段检测器优选地使用从'整个频谱'测量得出的第二、第三和第四Mel滤波器组的平均数。 The sub-band detector preferably uses a second average, the third and fourth Mel filter banks derived from the 'entire spectrum' measurement. 然后,该检测器以如下所述的方式对所有帧应用了下面的处理:(i)输入=p×当前输入+(1-p)×先前输入;(ii)如果帧数<15,则跟踪值=MAX(跟踪值,输入);(iii)如果输入<跟踪值×上限且输入>跟踪值×下限,则跟踪值=a×跟踪值+(1-a)×输入;(iv)如果输入<跟踪值×最低值,则跟踪值=b×跟踪值+(1-b)×输入(v)如果输入>跟踪值×门限,则输出'真',否则输出'假'。 Then, the detector in the manner described below is applied to all frames following process: (i) = p × input current input + (1-p) × previous input; (ii) if the number of frames <15, the tracking value = MAX (tracking value, input); (III) if the input <tracking value × the upper limit and the input> tracking value × lower limit, the trace value = a × tracking value + (1-a) × input; (iv) If the input <value × minimum value of tracking, the tracking value = b × tracking value + (1-b) × input (v) if the input> tracking × threshold value, outputs 'true', otherwise the output is 'false'.

其中,在子区测量中:p=0.75。 Wherein, in the sub-area measurement: p = 0.75.

除了等于3.25的门限外,对于整个频谱测量,所有其它参数都相同。 In addition to the threshold is equal to 3.25, the entire frequency spectrum for the measurement, all the other parameters are the same.

对于频谱方差测量,包括每帧增益的窄带谱表示的下半部分频率的值的方差被用作输入。 Variance values ​​for the lower half of the spectrum variance measure, each frame comprising a narrowband spectrum indicated partial gain frequency are used as input. 然后,该检测器对整个频谱测量应用了相同的处理。 Then, the detector of the entire spectrum that is using the same process.

该方差计算为:1N&Sigma;i=0N-1Wi2-(&Sigma;i=0N-1Wi)2/N2---[5]]]>其中:N=FFT长度/4,以及wi是增益的窄带谱表示的值。 The variance is calculated as: 1N & Sigma; i = 0N-1Wi2 - (& Sigma; i = 0N-1Wi) 2 / N2 --- [5]]]> where: N = FFT length / 4, and wi is the gain of the narrowband spectrum the value represents.

根据本发明的优选实施例,上面所详细描述的这三个测量被提供给VAD判决算法,如图3的流程图所示。 According to a preferred embodiment of the present invention, the three measurements described in detail above is provided to the VAD decision algorithm shown in the flowchart of FIG. 连续输入被提供给缓冲器,其提供上下文分析。 Is continuously fed to the buffer, which provides a context analysis. 这使得帧延迟等于缓冲器长度减去一帧。 This delay is equal to the frame buffer such that the length minus one.

现在参考图3,示出了根据本发明的优选实施例的用于噪声环境的基于加速率的话音活动验证处理的流程图300。 Referring now to Figure 3, a flowchart illustrating verification processing based on the speech activity in accordance with the rate of acceleration of the noise environment for the preferred embodiment 300 of the present invention.

对于N=7帧缓冲器,最近的真/假语音输入被存储在数据缓冲器中的位置N上,如步骤305所示。 A frame buffer for N = 7, the nearest true / false speech input is stored in the N position in the data buffer, as shown in step 305. 判决逻辑应用若干个下面的步骤,并且优选地应用每一步骤:步骤1:VN=测量1或测量2或测量3如果这三个测量中的任何一个返回真语音指示,则输入VN定义为'真'(T)。 Decision logic applied following several steps, and each step preferably applied: Step 1: VN = 1 or the measurement 2 measurement 3 if three or measuring any measurements indicating a voice returns true, the input is defined as VN ' true '(T).

步骤2: Step 2: 该算法搜索缓冲器中的'真'值的最长连续序列,如步骤310。 The algorithm searches buffer 'true' value of the longest contiguous sequence, as shown in step 310. 因此,例如,对于序列'TTFTTTF',M等于3。 Thus, for example, for a sequence 'TTFTTTF', M equals 3.

步骤3: Step 3:

如果M≥SP且T<LS,T=LS;其中,SP等同于步骤315中的第一门限。 If M≥SP and T <LS, T = LS; in which, SP equivalent to the first threshold in step 315. 如果在步骤315中真(T)语音值的最长序列等于或超过第一门限,即SP=3或更多连续'真'值,则缓冲器被判决为包含'可能(possible)'的语音。 If the maximum value of the sequence at step 315 the speech true (T) is equal to or exceeds the first threshold, i.e., SP = 3 or more consecutive 'true' value, the buffer is judged as comprising a 'possible (Possible)' speech . 如果在步骤320中确定还未存在(或超过),则在步骤325中启动例如LS=5帧的短计时器T(时间_1)。 If it is determined not already exist (or exceeded) in step 320, the start frame, for example, LS = 5 short timer T (time _1) in step 325.

步骤4:如果M≥SL且F>FS,T=LM,否则T=LL;其中,SL等于步骤330中的第二门限。 Step 4: If M≥SL and F> FS, T = LM, otherwise T = LL; wherein, SL in step 330 is equal to the second threshold. 如果存在SL=4或更多连续的'真'值,则再次判断缓冲器包含'可能(likely)'的语音。 If there is continuous or SL = 4 'true' value is more, it is determined again comprises buffer 'may (LIKELY)' speech. 如果如步骤335中所确定的当前帧F处于初始导入安全周期FS之外,则在步骤340中启动例如LM=22帧的中计时器T。 If, as determined in step 335 the current frame F in the initial introduction period of the FS safety outside, is started e.g. LM = 22 frame timer at step 340 T. 否则,在步骤345中使用例如LL=40帧的故障保险长计时器T。 Otherwise, at step 345 using, for example, LL = long failsafe timer 40 of the T. 在话语中的语音早期出现时使用这种布置会使VAD的初始的噪声估值过高。 The initial use of this arrangement will make noise when VAD voice early in the discourse appear overvalued.

步骤5:如果M<SP且T>0,T--;如果该处理在步骤350中确定存在小于SP=3的连续'真'值且计时器在步骤355中大于零,则计时器在步骤360中递减。 Step 5: If M <SP and T> 0, T--; If the process determined in step 350 is less than the present continuous SP = 'true' value of the timer is greater than zero and 3, in step 355, the timer at step 360 decreasing.

步骤6:如果T>0,输出'真',否则输出'假';如果计时器在步骤365中大于零,则该处理输出'真'语音判决,如步骤370所示。 Step 6: If T> 0, the output of 'true', otherwise the output is 'false'; if the timer is greater than zero at step 365, the process outputs 'true' voice judgment, as shown in step 370. 另外,如果计时器在步骤365中不大于零,则该处理输出'噪声'判决,如步骤375所示。 Further, if the timer is not greater than zero at step 365, the process outputs 'noise' decision, as shown in step 375.

步骤7:Frame++,把缓冲器向左移位且返回至步骤1。 Step 7: Frame ++, shifted to the left to the buffer and returns to step 1.

在步骤380中准备下一帧,缓冲器向左移位,以容纳下一输入,如根据图4所示。 In the preparation step 380 the next frame buffer shifted to the left, to receive the next input, as in accordance with Figure 4. 该输出语音判决应用于从该缓冲器出来的帧。 The output speech frame is applied to the decision out of the buffer. 然后在步骤305中对输入到数据缓冲器中的下一个真/假输入重复该处理。 Then in step 305 the input to the data buffer a true / false input processing is repeated.

执行根据如上所述的能量加速率处理作出语音或噪声判决的替换机制也在本发明的考虑范围之内。 Also considered within the scope of the present invention to perform voice or noise energy based rate of acceleration judgment processing as described above replacement mechanism. 例如,该判决机制可能不是基于一个或多个计时器,而可能完全地根据是否超过一个或多个能量加速率门限而作出判决的。 For example, the decision mechanism may not be based on one or more timers, and may completely acceleration rate threshold and judgment based on whether or over one or more energy.

现在参考图4,更详细地示出了根据本发明的优选实施例的缓冲器操作400的示例。 Referring now to Figure 4, shown in more detail an example of a buffer according to the preferred embodiment of the present invention, operation 400. 我们假定第一门限设置为三个连续的'真'值。 We assumed that the first threshold is set three consecutive 'true' value. 在“t”410时,假定只有当前输入(帧#7)425和先前输入(帧#6)420为'真'。 In the "t" 410, assume that only the current input (frame # 7) 425 and the previous input (frame # 6) 420 'true'. 因此,当该缓冲器移位时,第一帧(帧#1)415将被标记为假。 Thus, when the buffer is displaced, the first frame (frame # 1) 415 will be marked false.

在't+1'430时,第三'真'输入(帧#8)450已被接收,以增补以前的两个'真'输入440和445。 In the 't + 1'430, the third' true 'input (frame # 8) 450 have been received, in order to supplement the two previous' true' inputs 440 and 445. 因此,当该缓冲器移位时,下一个输出帧(帧#2)435将被标记为'真'。 Thus, when the buffer is shifted, the next output frame (frame # 2) 435 is marked as 'true'.

应当注意,在上述的判定处理中,唯一的约束是:(i)时间_1<时间_2<时间_3,且(ii)门限_1<门限_2。 It should be noted that, in the above determination process, the only constraint is: (i) Time _1 <time _2 <_3 time, and (ii) threshold _1 <threshold _2.

假定只有这三个输入(帧#6、帧#7和帧#8)为'真',则整个输出序列是:F T T T T T T T T T T1 2 3 4 5 6 7 8 9 10 11T T T T T T F F F F F12 13 14 15 16 17 18 19 20 21 22其中,由于缓冲器导入功能,帧#2-#5指示为'真'。 It is assumed that only three inputs (frame # 6, the frame and the frame # 7 # 8) is 'true', then the entire sequence is output: FTTTTTTTTT T1 2 3 4 5 6 7 8 9 10 11T TTTTTFFFF F12 13 14 15 16 17 18 19 20 21 22 wherein the buffer import function since the frame # 2 to # 5 is indicated as 'true'. 帧#6-#8指示'真',作为实际的初始'真'语音输入的位置。 Frame # 6- # 8 indicates 'true', the initial position as the actual 'real' speech input. 由于缓冲器导出功能,帧#9-#12指示为'真'。 Since buffer export functions, a frame # 9- # 12 indicated as 'true'. 响应于所使用的计时器延迟,帧#13-#18指示'真'。 In response to the delay timer is used, the frames # 13- # 18 indicates 'true'. 当话语中的所有帧都被输入时,缓冲器移出'假'条目(帧#19-#LM)直到清空。 When all frames have been input utterance, the buffer was removed 'false' entry (frames # 19- # LM) until clear.

缓冲器长度和延迟计时器可以被动态地调整为满足音频通信单元的需求,这也在本发明的范围之内。 Buffer length and the delay timer can be dynamically adjusted to meet the needs of the audio communication unit, which is also within the scope of the present invention. 同样,使用'N'为8的缓冲器长度的优选实施例和5帧的延迟计时器只是出于解释性的目的。 Similarly, the use of 'N' in Example 5 and the delay timer is the length of the buffer is preferably 8 explanatory purposes only. 但是,应当注意,缓冲器长度'N'应当总是确定为N≥SL。 However, it should be noted that the buffer length 'N' should always be determined N≥SL.

除了用作其自身VAD之外,在图2的方法步骤中执行的能量加速率测量可以用于验证其它参数的初始化,这也在本发明的考虑范围之内。 In addition to serving its own VAD outside energy executed in method step of FIG. 2 may be used to verify the acceleration measurement initializing other parameters, which is also contemplated by the present invention. 例如,频谱提取方案根据语音的首十帧(典型地为100毫秒)来要求噪声的初始估值。 For example, spectrum extraction scheme according to the first ten frames of speech (typically 100 milliseconds) required an initial estimate of the noise. 甚至在平稳噪声中,可能发生若干事件而致使初始估值无效。 Even in stationary noise, several events invalidate the initial valuation may occur. 这种事件的示例包括:(a)信号的上斜:由于各种可能的原因,在估值时,记录的开始可能在该周期内'上斜'至满值。 Examples of such events include: (a) The ramp signal: due to various reasons, during the valuation, recording may start 'uphill' within the period to the full value. 完全上斜的原因包括:数字系统中的缓冲器填充,模拟系统中的容量或带头连接。 Cause completely uphill comprises: a digital buffer filling system, the capacity of analog systems or connecting lead. 这些事件的影响使该估值无效。 The impact of these events that the valuation is not valid. 因此,能量加速率测量可以用于检测这种上斜并防止出现这种失误。 Therefore, the energy measure can be used to accelerate the rate of this ramp and the detector to prevent such errors.

(b)初始信号中的毛刺:普通'毛刺'伴随着用户无线单元上的一键通(PTT)按钮的完整动作而发生,其中,电接触极少发生在按钮碰击开关背部之前。 (B) the initial signal glitch: Normal 'glitches' along with the full operation of the PTT wireless unit users (PTT) button occurs, wherein the electrical contact rarely occurs before the button switch back knocking. 如上所述,当发生这种事件时,能量加速率测量可以用于挂起估值处理,如图2的步骤225所示。 As described above, when such an event occurs, the energy measurement may be used to accelerate suspended valuation process, step 225 in FIG. 2 FIG.

(c)初始信号中的语音:另一通常发生的事件是,具体地说对于PTT系统,用户在按下PTT按钮时立即开始讲话。 (C) the original speech signal: is another event usually occurs, in particular for the PTT system, the user to begin speaking immediately when the PTT button is pressed. 通过这种方式,在语音开始之后进行电接触。 In this way, the electrical contact after the beginning of speech. 能量加速率测量可以识别这一点且挂起基于噪声的初始化,如图2的步骤225所示,或者强迫使用故障估值。 Energy acceleration measurements and it can be identified based on the noise suspended initialization, as shown in FIG. 2 in step 225, estimates or forced to use a fault.

总之,已对包括具有话音活动检测机制的音频处理单元的通信单元进行描述。 In summary, the communication unit has an audio processing unit having a voice activity detection mechanism to be described include. 话音活动检测机制提供输入至通信单元的信号输入的能量加速率的指示且根据所述指示来确定所述输入信号是语音还是噪声。 Indicating voice activity detection mechanism provides a signal input to the communication unit of energy and the acceleration rate is determined according to the indication of the input signal is a speech or a noise.

此外,已对检测输入到通信单元中的语音信号的方法进行描述。 Further, it has been detected is input to the speech signals of the communication unit will be described. 该方法包括以下步骤:指示输入到通信单元的输入信号的加速率;以及根据所述指示步骤来确定所述输入信号是语音还是噪声。 The method comprises the steps of: instruction input to the acceleration of the input signal of the communication unit; and the step of determining the indication according to said input signal is a speech or a noise.

此外,已对判决输入到通信单元中的信号是语音还是噪声的方法进行描述。 Further, the signal communication unit is inputted to the judgment method is a speech or a noise will be described. 该方法包括以下步骤:根据能量加速率判决所述输入信号是语音还是噪声,例如使用若干输入信号的帧平均或滚动平均。 The method comprises the steps of: accelerating the rate of energy input in accordance with the decision signal is a speech or a noise, for example, an average of a plurality of frames of the input signal or the rolling average.

因此,应当理解,如上所述的用于噪声环境的基于能量加速率的话音活动检测器和验证器提供了噪声鲁棒性和快速响应的优点。 Accordingly, it should be understood that, for noise environment offers the advantage of fast response and noise robust voice activity detector based on the energy and the rate of acceleration of the verification described above. 由于优选实施例使用依赖于能量加速率的测量,而不是绝对的测量,所以在此所描述的发明构思可以应用于任何输入电平的语音。 Since the preferred embodiment is to use the energy measure depends on the acceleration rate, rather than absolute measurements, the inventive concepts described herein may be applied to any level of the input speech.

虽然上面已对本发明的实施例的特定和优选实现进行了描述,但是应当清楚,本领域的技术人员易于应用落入本发明的范围之内的这种发明构思的变化和修改。 While the above embodiment has been of particular and preferred embodiment of the present invention is implemented has been described, it will be apparent skilled in the art will readily apply variations and modifications fall within the concept of this invention within the scope of the invention.

因此,已对用于噪声环境的经改善的话音活动检测器和验证器进行描述,其中,基本上消除了与现有技术布置相关联的上述缺点。 Therefore, to have an improved voice activity detector and verifier for noise environment is described, which substantially eliminates the aforementioned disadvantages associated with the prior art arrangement.

Claims (15)

1.一种通信单元(100),其包含具有话音活动检测机制(130,135)的音频处理单元(109),所述通信单元(100)的特征在于,所述话音活动检测机制(130,135)测量输入到所述通信单元(100)中的信号的能量加速率,并根据所述测量确定所述输入信号是语音还是噪声。 A communication unit (100), comprising an audio processing unit (109) having a voice activity detection mechanism (130, 135), and wherein said communication unit (100) that the voice activity detection mechanism (130, 135) measures the energy input to the communication unit in the acceleration rate (100) signal, and determining from the input signal is a measure of the speech or noise.
2.如权利要求1所述的通信单元(100),其中,所述话音活动检测机制包括话音活动检测器功能块(130),其对输入到所述话音活动检测机制(130,135)中的信号执行话音的逐帧检测。 2. The communication unit according to claim 1 (100), wherein the voice activity detection mechanism comprises a voice activity detector function block (130), wherein the input to the voice activity detection mechanism (130, 135) Frame performing voice detection signal.
3.如权利要求2所述的通信单元(100),其中,所述逐帧检测包括针对下述的频率范围中的一个或多个对输入到所述话音活动检测机制(130,135)中的信号执行能量加速率测量:(i)整个频谱(ii)频谱子频段;以及(iii)频谱方差。 The communication unit according to claim 2 (100), wherein said detecting comprises a frame by frame in the input to the voice activity detection mechanism (130, 135) for a frequency range of one or more of the following performing acceleration signal energy measurement: (i) the entire spectrum (ii) spectral sub-band; and (iii) variance spectrum.
4.如权利要求3所述的通信单元(100),其中,所述话音活动检测机制包括话音活动判决功能块(135),其可操作地连接至所述话音活动检测器功能块(130),以根据一个或多个所述测量的缓冲操作来判决所述输入信号是否是语音。 4. The communication unit according to claim 3 (100), wherein the voice activity detection mechanism comprises a voice activity decision function block (135), operably coupled to the voice activity detector function block (130) in order to decide, according to one or more of the buffering operation of said measured input signal is a speech.
5.如权利要求4所述的通信单元(100),其中,所述话音活动判决功能块(135)使用多个所述输入信号的帧平均或滚动平均来判决输入信号是否是语音。 5. The communication unit as claimed in claim 4 (100), wherein the voice activity decision function block (135) using a plurality of frames of the input signal average or rolling average to decide whether the input signal is a speech.
6.如权利要求2至5中的任一项所述的通信单元(100),其中,如果所述能量加速率测量得出大于能量加速率门限的能量加速率值,则认为输入帧是语音帧(265)。 The communication unit (100) according to any one of claim 2 to 5, wherein, if the energy of acceleration measurement values ​​obtained acceleration energy greater than the energy threshold acceleration rate is considered input frame is a speech frame (265).
7.如权利要求6所述的通信单元(100),其中,确定输入帧是语音帧的判决(265)的应用可追溯至输入信号的缓冲器中的前面的帧。 7. The communication unit according to claim 6, (100), wherein the application determines whether the input frame is a speech frame decision (265) may be traced back to the front of the buffer input signal frames.
8.如权利要求6或权利要求7所述的通信单元(100),其中,如果对于多个连续帧,所述能量加速率测量得出大于能量加速率门限的能量加速率值,则认为输入帧是语音帧(370)。 As claimed in claim 6 or claim 7, the communication unit (100), wherein, if a plurality of successive frames, the measured acceleration rate that is greater than the energy energy energy acceleration rate acceleration threshold values, it is recognized frame is a speech frame (370).
9.当依赖于权利要求3时,如权利要求3至8中的任一项所述的通信单元(100),其中,如果选择输入信号频谱的子区,则该选择是基于子区最有可能包含话音信号的基本基音而作出的。 9. When dependent on claim 3, as set forth in the communication unit 3-8 (100) according to claim, wherein, if the selected sub-region of the input signal spectrum, is selected based on the sub-regions most may contain basic pitch of the speech signal is made.
10.如前面的任一项权利要求所述的通信单元(100),其中,所述话音活动检测机制(130,135)使用话音能量的相关特征的加速率来验证其它话音或噪声的相关量度的参数初始化,例如频谱提取方案。 As claimed in any one of the preceding claims of said communication unit (100), wherein said voice activity detection mechanism (130, 135) related to the rate of acceleration using a voice feature to verify that an energy measure of voice or other noise the initialization parameters, such as spectral extraction plan.
11.一种检测输入至通信单元中的语音信号的方法,其特征在于,包含以下步骤:测量输入至所述通信单元中的输入信号的能量中的加速率或变化;以及根据所述测量步骤来确定(315,330,350)所述输入信号是语音(370)还是噪声(375)。 11. A method of detecting a speech signal input to the communication unit, characterized by comprising the steps of: measuring a rate or acceleration input to the communication unit changes the energy of the input signal; and based on said measuring step determining (315,330,350) the signal is a speech input (370) or the noise (375).
12.如权利要求11所述的语音信号检测方法,其特征在于,进一步包含以下步骤:对输入至所述通信单元中的信号执行话音的逐帧检测。 12. The voice signal detection method according to claim 11, wherein, further comprising the step of: frame-detecting signal input to perform the voice communication unit.
13.如权利要求12所述的语音信号检测方法,其中,所述逐帧检测包括以下步骤:针对一个或多个下面的频率范围,对所述输入信号执行能量加速率测量:(i)整个频谱(ii)频谱子频段;以及(iii)频谱方差。 13. The voice signal detection method of claim 12, wherein said detecting frame to frame comprising the steps of: for one or more of the following frequency ranges, performs on the input signal power measuring acceleration rate: (i) the entire spectrum (ii) spectral sub-band; and (iii) variance spectrum.
14.一种判决输入至通信单元中的信号是语音还是噪声的方法,优选地根据前面权利要求11至13中的任一项权利要求,该方法的特征在于,进一步包含以下步骤:根据所述输入信号的能量测量中的能量加速率或变化来判决(315,330,350)所述输入信号是语音(370)还是噪声(375),例如使用多个输入信号的帧平均或滚动平均。 14. A decision input signal to the communication unit is speech or noise method, preferably according to any of 11 to 13 claims, characterized in that the method further comprises the step of the preceding claims: in accordance with the energy or acceleration rate measuring change in energy input to the decision signal (315,330,350) said input signal is a speech (370) or the noise (375), for example using a plurality of frame average or rolling average of the input signal.
15.如权利要求14所述的判决输入至通信单元中的信号是语音还是噪声的方法,其中,所述判决步骤包括:如果所述能量加速率测量得出能量加速率值大于能量加速率门限,则确定输入帧是语音帧(265);以及把所述确定可追溯地应用至输入信号的缓冲器中的前面的帧。 Decision input signal to the communication unit as claimed in claim 14 is a method of speech or noise, wherein said decision step comprises: if the energy of acceleration measurement values ​​obtained acceleration energy greater than the energy threshold acceleration rate , it is determined that the input frame is a speech frame (265); and the foregoing determination can be traced back to the input signal applied to the frame buffer.
CNB038026821A 2002-01-24 2003-01-10 Voice activity detector and validator for noisy environments CN1307613C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0201585A GB2384670B (en) 2002-01-24 2002-01-24 Voice activity detector and validator for noisy environments

Publications (2)

Publication Number Publication Date
CN1623186A true CN1623186A (en) 2005-06-01
CN1307613C CN1307613C (en) 2007-03-28

Family

ID=9929648

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB038026821A CN1307613C (en) 2002-01-24 2003-01-10 Voice activity detector and validator for noisy environments

Country Status (6)

Country Link
JP (2) JP2005516247A (en)
KR (2) KR100976082B1 (en)
CN (1) CN1307613C (en)
FI (1) FI124869B (en)
GB (1) GB2384670B (en)
WO (1) WO2003063138A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100543841C (en) 2005-10-21 2009-09-23 神基科技股份有限公司 Circuit structure for sound source processing and processing method thereof
WO2011044853A1 (en) * 2009-10-15 2011-04-21 华为技术有限公司 Method and device for realizing trace of background noise in communication system
CN102884575A (en) * 2010-04-22 2013-01-16 高通股份有限公司 Voice activity detection
US8898058B2 (en) 2010-10-25 2014-11-25 Qualcomm Incorporated Systems, methods, and apparatus for voice activity detection
CN104575498A (en) * 2015-01-30 2015-04-29 深圳市云之讯网络技术有限公司 Recognition method and system of effective speeches

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100657912B1 (en) * 2004-11-18 2006-12-14 삼성전자주식회사 Noise reduction method and apparatus
JP4758879B2 (en) * 2006-12-14 2011-08-31 日本電信電話株式会社 Temporary speech segment determination device, method, program and recording medium thereof, speech segment determination device, method
GB2450886B (en) 2007-07-10 2009-12-16 Motorola Inc Voice activity detector and a method of operation
WO2010048999A1 (en) * 2008-10-30 2010-05-06 Telefonaktiebolaget Lm Ericsson (Publ) Telephony content signal discrimination
KR101196518B1 (en) 2011-04-05 2012-11-01 성균관대학교산학협력단 Apparatus and method for detecting voice activity in real-time
RU2544293C1 (en) * 2013-10-11 2015-03-20 Сергей Александрович Косарев Method of measuring physical quantity using mobile electronic device and external unit
JP2016167678A (en) * 2015-03-09 2016-09-15 株式会社リコー Communication device, communication system, log data storage method, and program

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IT1209561B (en) * 1983-07-14 1989-08-30 Gte Laboratories Inc Complementary Revelation of the word.
JP2559475B2 (en) * 1988-09-22 1996-12-04 積水化学工業株式会社 Voice detection method
JPH03114100A (en) * 1989-09-28 1991-05-15 Matsushita Electric Ind Co Ltd Voice section detecting device
JP3024447B2 (en) * 1993-07-13 2000-03-21 日本電気株式会社 Voice compression device
JP3109978B2 (en) * 1995-04-28 2000-11-20 松下電器産業株式会社 Speech segment detection device
US5774849A (en) 1996-01-22 1998-06-30 Rockwell International Corporation Method and apparatus for generating frame voicing decisions of an incoming speech signal
JPH10171497A (en) * 1996-12-12 1998-06-26 Oki Electric Ind Co Ltd Background noise removing device
US5946649A (en) * 1997-04-16 1999-08-31 Technology Research Association Of Medical Welfare Apparatus Esophageal speech injection noise detection and rejection
JP3297346B2 (en) * 1997-04-30 2002-07-02 沖電気工業株式会社 Voice detection device
JPH10327089A (en) * 1997-05-23 1998-12-08 Matsushita Electric Ind Co Ltd Portable telephone set
JPH113091A (en) * 1997-06-13 1999-01-06 Matsushita Electric Ind Co Ltd Detection device of aural signal rise
US6032116A (en) * 1997-06-27 2000-02-29 Advanced Micro Devices, Inc. Distance measure in a speech recognition system for speech recognition using frequency shifting factors to compensate for input signal frequency shifts
FR2768544B1 (en) * 1997-09-18 1999-11-19 Matra Communication Method for detection of vocal activity
JP4221537B2 (en) * 2000-06-02 2009-02-12 日本電気株式会社 Voice detection method and apparatus and recording medium therefor

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100543841C (en) 2005-10-21 2009-09-23 神基科技股份有限公司 Circuit structure for sound source processing and processing method thereof
WO2011044853A1 (en) * 2009-10-15 2011-04-21 华为技术有限公司 Method and device for realizing trace of background noise in communication system
US8095361B2 (en) 2009-10-15 2012-01-10 Huawei Technologies Co., Ltd. Method and device for tracking background noise in communication system
US8447601B2 (en) 2009-10-15 2013-05-21 Huawei Technologies Co., Ltd. Method and device for tracking background noise in communication system
CN102884575A (en) * 2010-04-22 2013-01-16 高通股份有限公司 Voice activity detection
US9165567B2 (en) 2010-04-22 2015-10-20 Qualcomm Incorporated Systems, methods, and apparatus for speech feature detection
US8898058B2 (en) 2010-10-25 2014-11-25 Qualcomm Incorporated Systems, methods, and apparatus for voice activity detection
CN104575498A (en) * 2015-01-30 2015-04-29 深圳市云之讯网络技术有限公司 Recognition method and system of effective speeches
CN104575498B (en) * 2015-01-30 2018-08-17 深圳市云之讯网络技术有限公司 Efficient voice recognition methods and system

Also Published As

Publication number Publication date
WO2003063138A1 (en) 2003-07-31
KR20090127182A (en) 2009-12-09
JP2010061151A (en) 2010-03-18
GB0201585D0 (en) 2002-03-13
KR100976082B1 (en) 2010-08-16
KR20040075959A (en) 2004-08-30
FI20041013B1 (en)
FI20041013A (en) 2004-09-22
JP2005516247A (en) 2005-06-02
GB2384670B (en) 2004-02-18
FI124869B (en) 2015-02-27
CN1307613C (en) 2007-03-28
GB2384670A (en) 2003-07-30

Similar Documents

Publication Publication Date Title
US8194882B2 (en) System and method for providing single microphone noise suppression fallback
US8554564B2 (en) Speech end-pointer
USRE43985E1 (en) Controlling loudness of speech in signals that contain speech and other types of audio material
EP2089877B1 (en) Voice activity detection system and method
Freeman et al. The voice activity detector for the Pan-European digital cellular mobile telephone service
US8175876B2 (en) System and method for an endpoint detection of speech for improved speech recognition in noisy environments
US6289309B1 (en) Noise spectrum tracking for speech enhancement
CN1257486C (en) Method and apparatus for sensing relative information kept in audio signals
US5715372A (en) Method and apparatus for characterizing an input signal
ES2255982T3 (en) Voice end indicator in the presence of noise.
US5276765A (en) Voice activity detection
JP4307557B2 (en) Voice activity detector
US6249757B1 (en) System for detecting voice activity
EP0962913A1 (en) Speech recognition
US8311819B2 (en) System for detecting speech with background voice estimates and noise estimates
JP2008293038A (en) Voice activity detection device and mobile station, and voice activity detection method
Ghosh et al. Robust voice activity detection using long-term signal variability
US20170078791A1 (en) Spatial adaptation in multi-microphone sound capture
US8204754B2 (en) System and method for an improved voice detector
CA2494637C (en) Method and apparatus for multi-sensory speech enhancement on a mobile device
JP5070873B2 (en) Sound source direction estimating apparatus, sound source direction estimating method, and computer program
US6216103B1 (en) Method for implementing a speech recognition system to determine speech endpoints during conditions with background noise
CN100502443C (en) Apparatus for controlling echo suppression in communications systems
JP2006079079A (en) Distributed speech recognition system and its method
US20030043940A1 (en) Digital automatic gain control with feedback induced noise suppression

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
C41 Transfer of patent application or patent right or utility model
ASS Succession or assignment of patent right

Owner name: MOTOROLA MOBILE CO., LTD.

Free format text: FORMER OWNER: MOTOROLA INC.

Effective date: 20110113

C56 Change in the name or address of the patentee
C41 Transfer of patent application or patent right or utility model