KR20040075959A

KR20040075959A - Voice activity detector and validator for noisy environments

Info

Publication number: KR20040075959A
Application number: KR10-2004-7011459A
Authority: KR
Inventors: 엘리더글라스랄프; 켈리허홀리루이스; 피어스데이비드존벤자민
Original assignee: 모토로라 인코포레이티드
Priority date: 2002-01-24
Filing date: 2003-01-10
Publication date: 2004-08-30
Also published as: JP2010061151A; KR100976082B1; FI20041013A; GB0201585D0; JP2005516247A; WO2003063138A1; CN1623186A; GB2384670B; CN1307613C; GB2384670A; FI124869B; KR20090127182A

Abstract

통신 유닛(100)은 음성 활동도 검출 메커니즘(130, 135)을 갖는 오디오 처리 유닛(109)을 포함한다. 음성 활동도 검출 메커니즘(130, 135)은 통신 유닛(100)으로의 신호 입력의 에너지 가속을 측정하여 상기 측정을 토대로 상기 입력 신호가 음인지 잡음인지를 결정한다. 음성 검출 방법 및 입력 신호가 음인지 잡음인지를 결정하는 방법이 또한 서술되어 있다. 특히 잡음 환경에 대한 음성 활동도 검출기 및 밸리데이터(validator)를 토대로 한 에너지 가속을 사용하면, 입력 음의 레벨의 독립성, 고속 응답 및 잡음 견고성(noise robustness)의 이점들을 제공한다.The communication unit 100 includes an audio processing unit 109 having voice activity detection mechanisms 130, 135. Voice activity detection mechanisms 130 and 135 measure the energy acceleration of the signal input to communication unit 100 to determine whether the input signal is negative or noise based on the measurement. A speech detection method and a method of determining whether an input signal is sound or noise are also described. In particular, the use of energy acceleration based on voice activity detectors and validators for a noisy environment offers the advantages of independence of the input sound level, fast response and noise robustness.

Description

Voice activity detector and validator for noisy environments

사설 이동 무선 사용자들을 위한 테트라(TETRA: TErrestrial Trunked RAdio) 시스템 및 전 지구적 이동 통신 시스템(GSM) 셀룰러 전화 표준과 같은 많은 음성 통신 시스템들은 음-처리 유닛들을 사용하여, 음 패턴들을 엔코딩 및 디코딩한다. 이와 같은 음성 통신 시스템들에서, 음 엔코더는 전송을 위하여 아날로그 음성 패턴을 적절한 디지털 포맷으로 변환시킨다. 음 디코더는 수신된 디지털 음 신호를 가청 아날로그 음 패턴으로 변환시킨다.Many voice communication systems, such as the TETRA (TErrestrial Trunked RAdio) system and the Global Mobile Communication System (GSM) cellular telephone standard for private mobile wireless users, use sound processing units to encode and decode sound patterns. . In such voice communication systems, the sound encoder converts the analog voice pattern into an appropriate digital format for transmission. The tone decoder converts the received digital tone signal into an audible analog tone pattern.

음성 활동도를 검출하는 방법들 및 장치는 종래 기술에 공지되어 있다. 음성 활동도 검출기(VAD)는, 음이 단지 오디오 신호 부분에서만 제공된다는 가정하에서 동작한다. 단지 침묵 또는 배경 잡음을 나타내는 많은 오디오 신호 간격들이 존재하기 때문에, 이 가정은 통상적으로 옳다.Methods and apparatus for detecting voice activity are known in the art. Voice activity detector (VAD) operates under the assumption that sound is provided only in the audio signal portion. This assumption is usually correct because there are many audio signal intervals that merely indicate silence or background noise.

음성 활동도 검출기는 많은 용도로 사용될 수 있다. 음이 존재하지 않을 때,이들 용도로서 전송 시스템에서 전체 전송 활동도를 억제하여, 전력 및 채널 대역폭을 상당히 절약하는 것을 포함한다. 음 활동도가 재개되었다는 것을 VAD가 검출할 때, 이는 전송 활동도를 다시 초기화한다.Voice activity detectors can be used for many purposes. When no sound is present, these uses include suppressing the overall transmission activity in the transmission system, thereby significantly saving power and channel bandwidth. When the VAD detects that the tone activity has resumed, it reinitializes the transmission activity.

음을 포함하는 오디오 부분들을 "무음(speechless)"인 부분들과 구별함으로써, 음성 활동도 검출기는 또한, 음 저장 장치들과 결합하여 사용될 수 있다. 그 후, 음을 포함하는 부분들은 저장 장치에 저장되고 "무음" 부분들은 폐기된다.By distinguishing audio portions that contain sound from those that are “speechless”, voice activity detectors can also be used in combination with sound storage devices. Thereafter, the portions containing the sound are stored in the storage device and the "silent" portions are discarded.

음성을 검출하는 종래 방법들은 적어도 부분적으로 음 신호의 전력을 검출 및 평가하는 방법들을 토대로 한다. 추정된 전력은 일정하거나 적응적인 임계값중 하나와 비교되어, 신호가 음인지를 결정한다. 이들 방법들의 주요한 이점은 복잡도를 낮춰, 저-처리 자원 구현하는데 적합하게 된다. 이와 같은 방법들의 주요 단점은, "음"이 실제 제공되지 않을 때 검출되는 "음"에서 배경 잡음이 우발적으로 발생될 수 있다는 것이다. 대안적으로, 제공되는 "음"은 불분명하기 때문에 검출될 수 없고 배경 잡음으로 인해 검출하는 것이 곤란하다.Conventional methods of detecting speech are based at least in part on methods of detecting and evaluating the power of a sound signal. The estimated power is compared with one of the constant or adaptive thresholds to determine if the signal is negative. The main advantage of these methods is their low complexity, making them suitable for implementing low-processing resources. The main disadvantage of such methods is that background noise can be inadvertently generated in the "sound" detected when the "sound" is not actually provided. Alternatively, the "sound" provided is unclear because it is unclear and difficult to detect due to background noise.

음 활동도를 검출하기 위한 일부 방법들은 잡음 이동 환경들에 관계하고, 음 신호의 적응형 필터링을 토대로 한다. 이는 최종 결정에 앞서 신호로부터 잡음 량을 감소시킨다. 주파수 스펙트럼 및 잡음 레벨은 가변될 수 있는데, 그 이유는 이 방법이 여러 스피커들 및 여러 환경들에 사용되기 때문이다. 그러므로, 입력 필터 및 임계값들은 종종 적응적이 되어, 이들 변화들을 추적한다.Some methods for detecting sound activity relate to noise transfer environments and are based on adaptive filtering of the sound signal. This reduces the amount of noise from the signal before the final decision. The frequency spectrum and noise level can be varied because this method is used in several speakers and in various environments. Therefore, input filters and thresholds are often adaptive, tracking these changes.

이들 방법들의 예들이 하프 레이트(half rate), 풀 레이트(full rate) 및 향상된 풀 레이트 음 트래픽 채널들 각각을 위한 GSM Specification 06.42 VoiceActivity Detector(VAD)에 제공되어 있다. 또 다른 이와 같은 방법은 ITU G.729 부록 B에 제안된 바와 같은 "Multi-Boundary Voice Activity Detection Algorithm"이다. 이들 방법들은 잡음 환경에서 매우 정확하지만 수행하는데 있어 상당히 복잡하다.Examples of these methods are provided in the GSM Specification 06.42 VoiceActivity Detector (VAD) for each of the half rate, full rate and enhanced full rate sound traffic channels. Another such method is the "Multi-Boundary Voice Activity Detection Algorithm" as proposed in ITU G.729 Annex B. These methods are very accurate in noisy environments but are quite complex to perform.

모든 이들 방법들은 입력될 음 신호를 필요로 한다. 음 압축해제 방식을 사용하는 일부 애플리케이션들은, 음 압축해제 공정동안 음 검출을 실행하는 것을 필요로 한다.All these methods require a sound signal to be input. Some applications using the sound decompression scheme require performing sound detection during the sound decompression process.

Benyassine 등에 의한 유럽 특허 출원 EP-A-0785419는 다음 단계들을 포함하는 음성 활동도 검출을 위한 방법에 관한 것이다.European patent application EP-A-0785419 by Benyassine et al. Relates to a method for voice activity detection comprising the following steps.

(i) 매 프레임 마다 인입하는 음성 신호로부터 소정 파라미터들의 세트를 추출하는 단계; 및,(i) extracting a set of predetermined parameters from the incoming voice signal every frame; And,

(ii) 상기 소정 파라미터들의 세트로부터 추출된 차 측정값들(difference measures)의 세트에 따라서 매 프레임 마다 상기 인입하는 음 신호의 프레임 음성 결정을 행하는 단계.(ii) making frame speech determination of the incoming sound signal every frame according to a set of difference measures extracted from the set of predetermined parameters.

셀룰러 시스템들에서 VAD는 바이어스되어, 당사자가 말할 때, 배경 잡음 및 이외 다른 손상들의 존재시에 상대방에게 음을 전달하도록 음 코덱 및 RF 회로 등을 포함하는 무선장치를 작동시키도록 한다. 그러나, 이는 당사자가 말하고 있지 않을 때에도 데이터를 전송한다. 이는 배터리 수명을 다소 낮추고 시스템의 다른 셀들 내에 있는 공통-채널 사용자들과의 간섭을 다소 증가시킨다. 이들은 근본적으로, 2차(또는, 이보다 높은) 영향들이다.In cellular systems, the VAD is biased to enable the radio, including sound codecs and RF circuitry, to communicate to the other party in the presence of background noise and other damages when the party speaks. However, it transmits data even when the parties are not talking. This slightly lowers battery life and slightly increases interference with common-channel users in other cells of the system. These are essentially secondary (or higher) effects.

이들 시스템들에서, 듀플렉스 호출에 이용가능 하게 되는 유한 자원에 대한 개념이 존재하지 않는다. 이것이 전체적으로 업링크 및 다운링크에 대해 가능하고 일관되며, 이는 통상적으로 여러 캐리어들 상에 존재하여 전체 대역폭을 동시에 이용하도록 한다.In these systems, there is no concept of finite resources being made available for duplex calls. This is possible and consistent for the uplink and the downlink as a whole, which is typically present on multiple carriers to make use of the full bandwidth simultaneously.

본 발명의 분야에서, 일부 음성 활동도 또는 음성 온셋 검출기들(VADs/VODs)은 (예를 들어, 자동상관을 통해서) 하모닉 구조와 같은 음의 특성들을 사용하여 음성화된 음을 구별하고자 시도하는 것이 공지되어 있다. 그러나, 잡음 하에서, 음 구조의 파괴 또는 잡음 구조 중 어느 하나로 인해 이들 구조적인 인디케이터들(indicators)은 실패할 수 있다. 이는, 예를 들어, 차량의 엔진, 타이어, 또는 에어컨 잡음일 수 있다. 최종적으로, 이들 방법들은 음성화되지 않은 음을 검출하는데 좋치 않다.In the field of the present invention, some speech activity or speech onset detectors (VADs / VODs) attempt to distinguish speeched speech using sound characteristics such as harmonic structure (eg, via autocorrelation). Known. However, under noise, these structural indicators may fail due to either destruction of the sound structure or noise structure. This may be, for example, the engine, tire, or air conditioner noise of the vehicle. Finally, these methods are not good for detecting unvoiced sounds.

대안적인 방법은 음을 검출하기 위하여 단지 프레임 에너지 레벨을 사용하는 것이다. 이는, 잡음 레벨을 넘는 임의의 임계값이 음을 표시하도록 설정될 수 있는 높은 신호-대-잡음 비(SNR) 상태들의 음에 대해선 만족스럽다. 그러나, 이 방법은 보다 실제적인 잡음 상태들에선 실패한다.An alternative method is to just use the frame energy level to detect sound. This is satisfactory for the sound of high signal-to-noise ratio (SNR) states, where any threshold above the noise level can be set to indicate the sound. However, this method fails in more practical noise states.

비표준화된 데이터베이스에 대해서 또는 실제 애플리케이션들에서, 한 세트의 예들에서의 잡음 레벨들은 또 다른 예들에서의 음 레벨들 보다 크게될 수 있는데, 이는 임계값을 설정하는 것을 불가능하게 한다. 이를 극복하기 위한 통상적인 방법은 잡음을 나타낸다는 가정 하에서 발음(utterance)의 제1의 100msec 정도를 평균화하여, 이 발음에 대한 특정한 임계값을 생성하는 것이다. 그러나, 또 다시,이는, 잡음이 초기 추정값으로부터 급격하게 발산될 수 있으며, 잡음이 높은 분산을 갖거나 제1의 몇개의 프레임들이 실제로 추정된 잡음이 아니라 오히려 음을 포함하는 비고정 잡음(non-stationary noise)에 대해선 불충분하게 된다.For non-standardized databases or in practical applications, the noise levels in one set of examples may be greater than the sound levels in other examples, which makes it impossible to set a threshold. A common way to overcome this is to average the first 100 msec of utterance under the assumption that it represents noise, to produce a specific threshold for this pronunciation. However, again, this may result in noise being radiated rapidly from the initial estimate, and non-fixed noise containing noise rather than noise having a high variance or the first few frames are not actually estimated noise. It is insufficient for stationary noise.

그러므로, 상술된 단점들을 경감할 수 있는 잡음 환경들에 대한 개선된 음성 활동도 검출기 및 밸리데이터가 필요로 된다.Therefore, there is a need for improved voice activity detectors and valley data for noise environments that can alleviate the above-mentioned disadvantages.

본 발명은 잡음 환경 내에서 음 검출(통상적으로, 음성 활동도 검출(VAD)로서 공지됨)에 관한 것이다. 본 발명은 음 검출 시스템에서 음성 신호들의 에너지 가속 측정에 적용되지만, 이에 국한되지 않는다.The present invention relates to sound detection (typically known as voice activity detection (VAD)) within a noisy environment. The invention applies to, but is not limited to, energy acceleration measurements of speech signals in a sound detection system.

도1은 본 발명의 바람직한 실시예의 음성 활동도 검출 및 검출을 수행하도록적응되는 통신 유닛의 블록도.1 is a block diagram of a communication unit adapted to perform voice activity detection and detection of a preferred embodiment of the present invention.

도2는 본 발명의 바람직한 실시예를 따른 잡음 환경들에 대한 음성 활동도 검출기를 토대로 한 에너지 가속의 순서도.2 is a flow chart of energy acceleration based on voice activity detector for noise environments in accordance with a preferred embodiment of the present invention.

도3은 본 발명의 바람직한 실시예를 따른 잡음 환경들에 대한 음성 활동도 검증을 토대로 한 에너지 가속의 순서도.3 is a flow chart of energy acceleration based on voice activity verification for noise environments in accordance with a preferred embodiment of the present invention.

도4는 본 발명의 바람직한 실시예를 따른 버퍼 동작을 도시한 도면.Figure 4 illustrates a buffer operation in accordance with a preferred embodiment of the present invention.

본 발명의 제 1 양상을 따르면, 청구항 1에 청구된 바와 같은 통신 유닛이 제공된다.According to a first aspect of the invention, there is provided a communication unit as claimed in claim 1.

본 발명의 제 2 양상을 따르면, 청구항 11에 청구된 바와 같은 통신 유닛으로의 음 신호 입력을 검출하는 방법이 제공된다.According to a second aspect of the invention, a method is provided for detecting a sound signal input to a communication unit as claimed in claim 11.

본 발명의 제 3 양상을 따르면, 청구항 14에 청구된 바와 같은 통신 유닛으로의 신호 입력이 음인지 또는 잡음인지를 결정하는 방법이 제공된다.According to a third aspect of the invention, a method is provided for determining whether a signal input to a communication unit as claimed in claim 14 is negative or noise.

본 발명의 부가적인 양상들은 종속항들에 청구된 바와 같다.Additional aspects of the invention are as claimed in the dependent claims.

요약하면, 본 발명은 음의 존재 또는 부재를 표시하기 위하여 에너지 진폭 측정하는 것 보다 오히려 에너지 가속 측정(energy acceleration measurement)을 사용함으로써, 임의의 진폭의 비고정 잡음의 경우를 처리하고자 하는 것이다.In summary, the present invention seeks to address the case of unfixed noise of arbitrary amplitude by using energy acceleration measurement rather than measuring energy amplitude to indicate the presence or absence of sound.

본 발명의 전형적인 실시예들이 지금부터 첨부한 도면들을 참조하여 설명될 것이다.Exemplary embodiments of the invention will now be described with reference to the accompanying drawings.

음성화된 음의 온셋(onset)이 진동 또는 정지중인 보컬 코드들(vocal cords)의 활동도에 좌우되기 때문에, 음성화된 음은 비교적 높은-에너지 가속 값을 갖는다. 유사하게, 음성화되지 않은 온셋들(예를 들어, 파열음들) 또한, 높은 에너지 가속들을 갖는다.Since the onset of the spoken sound depends on the activity of the vibrating or stationary vocal cords, the spoken sound has a relatively high-energy acceleration value. Similarly, non-negative onsets (eg, burst sounds) also have high energy accelerations.

본 발명가들은, 협대역 전력 스펙트럼 또는 Mel-스펙트럼과 같은 음성화를 강조하는 대표적인 도메인(representational domain emphasising voicing)에서, 합성 에너지 가속이 비고정 잡음 보다 상당히 높다는 것을 인지하였다. 단지 중요한 예외들은 자극적인 잡음들(예를 들어, 박수)이다.The inventors have recognized that in a representative domain stressing voicing, such as narrowband power spectrum or Mel-spectrum, synthetic energy acceleration is significantly higher than unfixed noise. Only important exceptions are irritating noises (eg, applause).

그러므로, 본 발명의 바람직한 실시예들을 따르면, 본 발명가들은 음성 신호의 기본적인 피치를 포함할 것 같은 주파수 영역에서 에너지에 대해 집중함으로써 이들 잡음들에 대해 부가적으로 구별할 수 있다는 것을 인지하였다. 특히, 본 발명의 발명가들은 음의 구조화되지 않은 특성, 즉 에너지 특성(또는 음 에너지 또는 이들의 성분들을 반영하는 어떤 메트릭의 가속)을 사용하는 것을 제안하였다.Therefore, in accordance with preferred embodiments of the present invention, the inventors have recognized that they can additionally discriminate against these noises by concentrating on energy in the frequency domain likely to include the basic pitch of the speech signal. In particular, the inventors of the present invention have proposed using negative unstructured properties, i.e., energy properties (or acceleration of any metric that reflects negative energy or components thereof).

특히, 본원에 서술된 본 발명의 개념들에 대한 바람직한 적용은 European Telecommunications Standards Institute(ETSI)-"Speech Processing, Transmission and Quality aspects(STQ): Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm", ETSI ES 201 108 v1.1.2(2000-04), April 2000에 현재 정의된 분산 음(distributed speech)이다.In particular, a preferred application to the concepts of the invention described herein is the European Telecommunications Standards Institute (ETSI)-"Speech Processing, Transmission and Quality aspects (STQ): Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm" , Distributed speech, currently defined in ETSI ES 201 108 v1.1.2 (2000-04), April 2000.

지금부터 도1을 참조하면, 본 발명의 바람직한 실시예들의 본 발명의 개념을 지원하도록 하는 오디오 가입자 유닛(100)의 블록도가 도시되어 있다.Referring now to FIG. 1, a block diagram of an audio subscriber unit 100 is shown to support the inventive concept of preferred embodiments of the present invention.

본 발명의 바람직한 실시예는 예를 들어, 장래 셀룰러 무선 통신 시스템들을 위한 3GPP(3^rd Generation Partnership Project) 표준으로 동작하여 DSR 성능들을 제공할 수 있는 무선 오디오 통신 유닛과 관련하여 서술된다. 그러나, 음성 활동도 검출 및 이의 검증과 관련하여 본원에 서술된 본 발명의 개념들이 음성 신호들에 응답하는 어떠한 전자 장치에 동일하게 적용될 수 있도록 하고 개선된 음성 활동도 검출 회로로부터 이점을 얻을 수 있도록 하는 것이 본 발명에서 고려되어 있다.A preferred embodiment of the present invention is, for example, 3GPP (3) for future cellular wireless communication systems.^rd Generation Partnership Project) is described in connection with a wireless audio communication unit that can operate as a standard to provide DSR capabilities. However, the concepts of the present invention described herein in connection with voice activity detection and verification can be equally applied to any electronic device responsive to voice signals and benefit from improved voice activity detection circuitry. It is considered in the present invention.

종래 기술에 공지된 바와 같이, 오디오 가입자 유닛(100)은 듀플렉스 필터에 바람직하게 결합된 안테나(102), 안테나 스위치 또는 오디오 가입자 유닛(100) 내에서 수신 및 전송 체인들간을 격리(isolation)시키는 서큘레이터(circulator)(104)를 포함한다.As is known in the art, the audio subscriber unit 100 is a circuit that isolates between receive and transmission chains within an antenna 102, antenna switch or audio subscriber unit 100, preferably coupled to a duplex filter. A circulator 104.

수신기 체인은 (수신, 필터링 및 중간 또는 기저대 주파수 변환을 효율적으로 제공하는)수신기 프론트-엔드 회로(106)를 포함한다. 프론트-엔드 회로(106)는 신호 처리 기능부(일반적으로 디지털 신호 처리기(DSP)로 구현된다)(108)에 직렬로결합된다. 신호 처리 기능부(108)는 신호 복조, 에러 정정 및 포맷팅을 수행한다. 신호 처리 기능부(108)로부터 복구된 데이터는 오디오 처리 기능부(109)에 직렬로 결합되며, 이 오디오 처리 기능부(109)는 적절한 방식으로 수신된 신호를 포맷하여 오디오 발음기(enunciator)/디스플레이(111)에 전송한다.The receiver chain includes a receiver front-end circuit 106 (which efficiently provides reception, filtering and intermediate or baseband frequency conversion). The front-end circuit 106 is coupled in series to the signal processing function (typically implemented with a digital signal processor (DSP)) 108. The signal processing function 108 performs signal demodulation, error correction, and formatting. The data recovered from the signal processing function 108 is serially coupled to the audio processing function 109, which formats the received signal in an appropriate manner to format the audio pronunciation / display. To 111.

본 발명의 여러 실시예들에서, 신호 처리 기능부(108) 및 오디오 처리 기능부(109)는 동일한 물리적 장치 내에 제공될 수 있다. 제어기(114)는 가입자 유닛(100)의 소자들의 동작 상태 및 정보 흐름을 제어하도록 구성된다.In various embodiments of the invention, the signal processing function 108 and the audio processing function 109 may be provided in the same physical device. The controller 114 is configured to control the operational state and information flow of the elements of the subscriber unit 100.

전송 체인과 관련하여, 이는 근본적으로, 오디오 처리 기능부(109), 신호 처리 기능부(108), 송신기/변조 회로(122) 및 전력 증폭기(124)를 통해서 직렬로 결합되는 오디오 입력 장치(120)을 포함한다. 처리기(108), 송신기/변조 회로(122) 및 전력 증폭기(124)는 제어기에 동작적으로 응답한다. 전력 증폭기 출력은 듀플렉스 필터, 안테나 스위치 또는 서큘레이터(104) 및 최종 무선 주파수 신호를 방사(radiate)하는 안테나(102)에 결합된다.With respect to the transmission chain, this is essentially an audio input device 120 coupled in series via the audio processing function 109, the signal processing function 108, the transmitter / modulation circuit 122 and the power amplifier 124. ). Processor 108, transmitter / modulation circuitry 122, and power amplifier 124 are operatively responsive to the controller. The power amplifier output is coupled to a duplex filter, an antenna switch or circulator 104 and an antenna 102 that radiates the final radio frequency signal.

특히, 오디오 처리 기능부(109)는 음성 활동도 결정 기능부(135)에 동작가능하게 결합된 음성 활동도(또는 음성 온셋) 검출(VAD) 기능부(130)를 포함한다. 본 발명의 바람직한 실시예들에 따르면, VAD 기능부(130) 및 음성 활동도 결정 기능(135)은 개선된 음성 검출 및 결정 메커니즘을 제공하도록 적응되며, 이의 동작이 도2 및 도3과 관련하여 부가 설명된다. 음성 활동도 검출기 기능부(130)가 3가지 측정들로 이루어진 매 프레임 마다 검출 스테이지(frame-by-frame detection stage)를 포함한다. 3개의 주파수 범위 측정들은 다음을 포함한다.In particular, the audio processing function 109 includes a voice activity (or voice onset) detection (VAD) function 130 operatively coupled to the voice activity determining function 135. According to preferred embodiments of the present invention, the VAD function 130 and the voice activity determination function 135 are adapted to provide an improved voice detection and determination mechanism, the operation of which is related to FIGS. 2 and 3. The addition is explained. Voice activity detector function 130 includes a frame-by-frame detection stage every frame of the three measurements. Three frequency range measurements include:

(i) 전체 스펙트럼;(i) full spectrum;

(ii) 스펙트럼 서브-대역들; 및,(ii) spectral sub-bands; And,

(iii) 스펙트럼 분산(spectral variance)(iii) spectral variance

다음에, 음성 활동도 결정 기능부(135)는 측정들의 버퍼를 토대로 결정을 수행하는데, 이들 측정들은 자신들의 음 가능도(speech likelihood)를 위하여 분석된다. 결정 스테이지로부터의 최종 결정은 버퍼내의 보다 이른 프레임에 소급적으로(retrospectively) 적용된다.Next, the voice activity determination function 135 makes a decision based on a buffer of measurements, which are analyzed for their speech likelihood. The final decision from the decision stage is applied retrospectively to earlier frames in the buffer.

본 발명의 바람직한 실시예에서, 타이머/카운터(118)는 또한, 도2 및 도3의 검출 및 결정 공정들에서 타이밍 기능들을 수행하도록 적응된다.In a preferred embodiment of the present invention, timer / counter 118 is also adapted to perform timing functions in the detection and determination processes of FIGS. 2 and 3.

신호 처리기 기능부(108), 오디오 처리 기능부(109), VAD 기능부(130) 및 음성 활동도 결정 기능부(135)는 별도의 동작 가능하게 결합된 처리 소자들로서 구현될 수 있다. 대안적으로, 하나 이상의 처리기들은 대응하는 처리 동작들 중 하나 이상의 동작들을 수행하도록 사용될 수 있다. 또 다른 부가적인 실시예에서, 상술된 기능부들은 주문형 반도체들(ASICs) 및/또는 처리기들, 예를 들어 디지털 신호 처리기들(DSPs)을 사용하여, 하드웨어, 소프트웨어 또는 펌웨어 요소들의 조합으로서 구현될 수 있다.The signal processor function 108, the audio processing function 109, the VAD function 130, and the voice activity determination function 135 may be implemented as separate operatively coupled processing elements. Alternatively, one or more processors may be used to perform one or more of the corresponding processing operations. In yet another additional embodiment, the above-described functional units may be implemented as a combination of hardware, software or firmware elements, using application specific semiconductors (ASICs) and / or processors, for example digital signal processors (DSPs). Can be.

물론, 오디오 가입자 유닛(100) 내의 각종 구성요소들은 이산 또는 집적 구성요소 형태로 실현될 수 있음으로, 최종 구조는 단지 자의적인 선택에 의해 이루어진다.Of course, the various components in the audio subscriber unit 100 can be realized in discrete or integrated component form, so that the final structure is made only by arbitrary choice.

이로 인해, 본 발명의 바람직한 실시예에 사용하기 위한 에너지 가속의 표시를 성취하는 여러 가지 방법들이 존재한다.As such, there are several ways to achieve an indication of energy acceleration for use in the preferred embodiment of the present invention.

(i) 이론적으로 이상적인 방법은 종래 공개된 출원 US 6009391 에서 알 수 있는 바와 같이, 발음의 연속적인 프레임들에 걸쳐서 에너지 레벨을 이중-미분(double-differentiation)하는 것이다. 이 방식이 지닌 단점은 분석 중에 있는 매 프레임 측마다 프레임들의 수를 분석할 필요가 있기 때문에 지연을 초래한다는 것이다.(i) The theoretical ideal method is to double-differentiate the energy level over successive frames of pronunciation, as can be seen in the previously published application US 6009391. The disadvantage of this approach is that it introduces delay because it needs to analyze the number of frames on every frame side under analysis.

(ii) 에너지 가속의 제로-지연 추정값은 예를 들어 다음과 같은 프레임 평균 또는 롤링 평균(rolling average)을 사용하여 순시값과 단기간 평균의 비를 비교함으로써 구할 수 있다.(ii) The zero-delay estimate of energy acceleration can be obtained, for example, by comparing the ratio of the instantaneous value to the short term average using the following frame or rolling average.

프레임 평균:Frame Average:

롤링 평균:Rolling Average:

각각의 경우에, 이 방법은 '감속'<'1'<'가속'으로서 해석될 수 있는 값들을 복귀시킨다. 그 후, 잡음으로부터 음을 최적으로 구별하는 분모 길이(denominator length) 및에 대한 실험적인 값들을 구할 수 있다.In each case, this method returns values that can be interpreted as 'deceleration'<'1'<'acceleration'. Then a denominator length that optimally distinguishes the sound from the noise and Experimental values for can be found.

본 발명의 발명가들은, 바람직한 최적의 해법은 비고정 잡음을 신속하게 추적할 수 있지만, 너무 길어 음성 온셋을 추적할 수 없는 분모를 구하는 것이라는 것을 인지하였다. 롤링 평균을 위한 제안된 값의 시퀀스는 a=0.2, b=0.8*a, c=0.8*b, 등이며, 이는 반복적으로 간단히 표현될 수 있다.The inventors of the present invention have recognized that the preferred optimal solution is to find a denominator that can track unfixed noise quickly but is too long to track speech onset. The proposed sequence of values for the rolling average is a = 0.2, b = 0.8 * a, c = 0.8 * b, etc., which can be simply expressed repeatedly.

이 때,At this time,

검출 스테이지 내의 바람직한 VAD 및 파라미터 초기화 시스템들은 도2의 순서도에 요약되어 있다. 비고정 잡음에서, 장기간 에너지 임계값들은 음의 신뢰할 수 없는 인디케이터이다. 유사하게, 높은 잡음 조건들에서, 음(예를 들어, 하모닉들)의 구조는 잡음에 의해 파괴될 수 있거나 구조적인 잡음들이 검출기를 혼동시킬 수 있기 때문에 인디케이터로서 완전히 신뢰할 수 없다. 그러므로, 바람직한 음 활동도 검출기는 음의 잡음-견고성 특성, 즉 음성 온셋과 관련된 에너지 가속을 사용하는 것이다.Preferred VAD and parameter initialization systems in the detection stage are summarized in the flowchart of FIG. In unfixed noise, long term energy thresholds are negative unreliable indicators. Similarly, in high noise conditions, the structure of a note (eg, harmonics) may not be completely reliable as an indicator because it may be destroyed by noise or structural noises may confuse the detector. Therefore, the preferred sound activity detector is to use the noise-ruggedness characteristic of sound, i.e. energy acceleration associated with speech onset.

지금부터 도2를 참조하면, 바람직한 검출 공정의 순서도(200)가 도시되어 있다. 상술된 바와 같이, 이 공정은 매 프레임 마다 분석을 포함한다. 바람직한 VAD 메커니즘은 '전체 스펙트럼' 측정 공정과 관계한다. 프레임 카운터는 초기에 평가되어, 단계(205)에 도시된 바와 같이 버퍼링된 프레임들의 수를 규정하는 'N' 보다작은지를 결정한다. 바람직한 실시예의 예로서, 매 프레임이 10msec 증분되도록 설정되었다 라고 하면, 'N'은 '15'로 설정된다. 프레임 카운터가 단계(205)에서 'N' 보다 작다면, 초기 가속 테스트를 위한 롤링 평균은 단계(210)에서처럼 갱신된다. 프레임 카운터가 단계(205)에서 'N' 보다 작지 않다면, 단계(210)를 건너뛴다.Referring now to FIG. 2, a flowchart 200 of a preferred detection process is shown. As mentioned above, this process involves analysis every frame. Preferred VAD mechanisms relate to the 'full spectrum' measurement process. The frame counter is initially evaluated to determine if it is less than 'N' which defines the number of buffered frames as shown in step 205. As an example of the preferred embodiment, if every frame is set to be incremented by 10 msec, 'N' is set to '15'. If the frame counter is less than 'N' in step 205, the rolling average for the initial acceleration test is updated as in step 210. If the frame counter is not less than 'N' in step 205, step 210 is skipped.

그 후, 단계(235)에 도시된 바와 같이, 에너지 가속 측정값이 하나 이상의 규정된 마진(들) 내에 있는지를 평가하기 위한 결정이 행해진다. 에너지 가속 측정값이 단계(235)에서 하나 이상의 규정된 마진(들) 내에 있다면, 롤링 평균은 단계(240)에서 처럼 부가 에너지 가속 테스트의 결과들로 갱신된다. 에너지 가속 측정값이 단계(235)에서 하나 이상의 특정 마진(들) 내에 있지 않다면, 단계(240)를 건너뛴다.Then, as shown in step 235, a determination is made to evaluate whether the energy acceleration measurement is within one or more defined margin (s). If the energy acceleration measurement is within one or more defined margin (s) in step 235, the rolling average is updated with the results of the additional energy acceleration test as in step 240. If the energy acceleration measurement is not within one or more specific margin (s) in step 235, step 240 is skipped.

그 후, 단계(260)에 도시된 바와 같이 에너지 가속 측정값이 규정된 임계값 보다 큰지를 평가하기 위한 결정이 행해진다. 에너지 가속 측정값이 단계(260)에서 규정된 임계값 보다 크다면, 프레임은 단계(265)에서 처럼 음 프레임이라 추정된다. 에너지 가속 측정값이 단계(260)에서 규정된 임계값 보다 크지 않다면, 프레임은 단계(270)에서 처럼 잡음 프레임이라 추정된다.Thereafter, a determination is made to evaluate whether the energy acceleration measurement is greater than the defined threshold, as shown in step 260. If the energy acceleration measurement is greater than the threshold defined in step 260, the frame is assumed to be a negative frame as in step 265. If the energy acceleration measurement is not greater than the threshold defined in step 260, the frame is assumed to be a noise frame, as in step 270.

그 후, 프레임 카운터는 단계(275)에서처럼 증분되고 이 공정은 단계(205)로부터 반복된다.The frame counter is then incremented as in step 275 and the process repeats from step 205.

전체 스펙트럼 측정 공정 대신 또는 이 공정 이외에도, 상기 공정에 대한 개선으로서, 선택적인 단계(215 및 245)에 도시된 서브-영역 측정 공정이 수행될 수 있다. 스펙트럼의 특정 서브-영역이 기본 피치를 가장 포함할 것 같은 것으로서 선택된다.Instead of or in addition to the full spectrum measurement process, as an improvement on the process, the sub-area measurement process shown in optional steps 215 and 245 can be performed. The particular sub-region of the spectrum is selected as most likely to contain the basic pitch.

서브-영역 공정에서, 초기 가속 테스트를 위한 롤링 평균이 전체 스펙트럼 측정에서 단계(210)에서 갱신되면, 에너지 가속 측정값이 단계(220)에서 도시된 바와 같이 임계값 보다 큰지를 검사하기 위한 결정이 행해진다. 에너지 가속 측정값이 단계(220)에서 임계값 보다 크다면, 다른 파라미터들의 초기화 공정은 단계(225)에서 도시된 바와 같이 중지된다. 에너지 가속 측정값이 단계(220)에서 임계값 보다 크지 않다면, 다른 파라미터들의 초기화는 단계(230)에서처럼 갱신된다. 그 후, 이 공정은 도시된 바와 같이 단계(235)로 복귀한다.In the sub-domain process, if the rolling average for the initial acceleration test is updated in step 210 in the full spectral measurement, a determination is made to check whether the energy acceleration measurement is greater than the threshold as shown in step 220. Is done. If the energy acceleration measurement is greater than the threshold in step 220, the initialization process of the other parameters is stopped as shown in step 225. If the energy acceleration measurement is not greater than the threshold in step 220, the initialization of the other parameters is updated as in step 230. The process then returns to step 235 as shown.

에너지 가속 측정값이 단계(235)에서 하나 이상의 규정된 마진(들) 내에 있는지를 평가하는 결정 후에 부가적인 바람직한 결정이 행해진다. 감속 값은 단계(250)에서 '하이'인지를 결정하도록 평가되고, 만일 그렇다면, 에너지 감속 테스트를 위한 롤링 평균은 단계(255)에 도시된 바와 같이 서서히 갱신된다. 그 후, 이 공정은 단계(260)에서 전체 스펙트럼 방법으로 복귀된다.An additional preferred decision is made after the decision to evaluate whether the energy acceleration measurement is within one or more defined margin (s) in step 235. The deceleration value is evaluated in step 250 to determine if it is 'high', and if so, the rolling average for the energy deceleration test is slowly updated as shown in step 255. This process then returns to full spectrum method at step 260.

이 방식에서, 서브-영역 검출기의 일반적으로 높은 신호-대-잡음 비들(SNRs)은 잡음-견고성을 높게 한다. 그러나, 대역-제한된 잡음뿐만 아니라 마이크로폰 및 스피커 변경들에 취약하다. 따라서, 이 측정들은 모든 환경들에서 신뢰될 수 없다. 결국, 본 발명의 바람직한 실시예는 서브-영역 검출기를 포함하여 전체 스펙트럼 측정값을 증가시킨다.In this way, the generally high signal-to-noise ratios (SNRs) of the sub-region detector make the noise- robustness high. However, it is vulnerable to microphone and speaker changes as well as band-limited noise. Thus, these measurements cannot be trusted in all environments. Consequently, the preferred embodiment of the present invention includes a sub-domain detector to increase the overall spectral measurement.

부가적인 측정 공정은 예를 들어, 매 프레임의 스펙트럼의 하부 절반 내의 값들의 분산의 '가속'을 사용하여 바람직하게 수행된다. 이 분산 측정은 스펙트럼의 하부 절반 내에서 구조를 검출하여, 음성화된 음에 매우 민감하게 된다. 분산 측정은 서브-영역 공정 방식을 따르는데, 이 스펙트럼의 하부 절반은 선택된 특정 서브-영역이다. 이 분산 측정은 전체 스펙트럼 측정 방식을 더욱 보완하여, 음성화되지 않은 음 및 파열음을 보다 양호하게 검출할 수 있도록 한다.Additional measurement processes are preferably performed using, for example, the 'acceleration' of the variance of the values in the lower half of the spectrum of every frame. This variance measurement detects the structure in the lower half of the spectrum, making it very sensitive to voiced sound. The variance measurement follows the sub-domain processing scheme, where the lower half of this spectrum is the particular sub-region chosen. This variance measurement further complements the full spectrum measurement approach, allowing better detection of unvoiced sounds and bursting sounds.

모든 3가지 측정들은 출원인 모토로라, 발명가 Yan-Ming Chen인 미국 특허 출원 US 09/427497에 서술된 바와 같은 이중 위너 필터(double Wiener filter)의 제1 스테이지에 의해 발생된 필터 이득들의 스펙트럼 표현으로부터 원 입력(raw input)을 취한다. 상술된 바와 같이, 각 측정은 이 데이터의 상이한 양상을 사용한다.All three measurements were originally input from the spectral representation of the filter gains generated by the first stage of a double Wiener filter as described in applicant Motorola, inventor Yan-Ming Chen, US patent application US 09/427497. takes (raw input) As mentioned above, each measurement uses a different aspect of this data.

특히, 전체 스펙트럼 검출기는 이중 위너 필터의 제1 스테이지에 의해 발생된 필터 이득들의 공지된 Mel-필터링된 스펙트럼 표현을 사용한다. 단일 입력 값은 Mel 필터 뱅크들의 합을 자승함으로써 구해진다.In particular, the full spectrum detector uses a known Mel-filtered spectral representation of the filter gains generated by the first stage of the double winner filter. The single input value is obtained by square the sum of the Mel filter banks.

본 발명의 바람직한 실시예에서, 전체 스펙트럼 검출기는 후술된 바와 같이 다음의 공정을 모든 프레임들에 적용한다.In a preferred embodiment of the present invention, the full spectrum detector applies the following process to all frames as described below.

단계 1은 다음 방식으로 잡음 추정 트랙커(Tracker)를 초기화한다.Step 1 initializes the estimated noise tracker (Tracker) in the following manner.

Frame< 15 및 Acceleration<2.5이면, Tracker = MAC(Tracker, Input)이다.If Frame <15 and Acceleration <2.5, then Tracker = MAC (Tracker, Input).

에너지 가속 측정은 음이 15프레임들의 리드-인 시간 내에서 발생되면 트랙커가 갱신되는 것을 방지한다.The energy acceleration measurement prevents the tracker from updating if sound is generated within the lead-in time of 15 frames.

단계 2는 현재 입력이 잡음 추정과 유사하다면 다음 방식으로 트랙커 값을 갱신한다.Step 2 updates the tracker value in the following manner if the current input is similar to a noise estimate.

Input<Tracker*UpperBound 및 Input>Tracker*LowerBound이면, 트랙커 = a*Tracker+(1-a)*Input이다.If Input <Tracker * UpperBound and Input> Tracker * LowerBound, then tracker = a * Tracker + (1-a) * Input.

단계 3은 제1의 몇개의 프레임들 내에서 음 또는 특징없이 큰 잡음이 존재하는 예들에 대한 안전한 메커니즘을 제공한다. 이는 최종 에러있는 높은 잡은 추정값이 감쇠(decay)되도록 한다. 단계 3은 다음 방식으로 바람직하게 기능한다.Step 3 provides a safe mechanism for examples where there is loud noise without sound or features within the first few frames. This causes the final errored high catch estimate to decay. Step 3 preferably functions in the following manner.

Input<Tracker*Floor이면, Tracker=b*Tracker+(1-b)*InputIf Input <Tracker * Floor, Tracker = b * Tracker + (1-b) * Input

단계 4는 현재 입력이 트랙커 보다 165% 이상으로 크다면 다음 방식으로 '참' 음 결정으로서 복귀한다.Step 4 returns as a 'true' negative decision in the following manner if the current input is greater than 165% above the tracker.

Input>Tracker*Threshold이면, TRUE을 출력하거나 그렇치 않다면 FALSE을 출력한다.TRUE if Input> Tracker * Threshold, or FALSE otherwise.

순시 입력 대 단기간 평균 트랙커의 비는 연속 입력들의 에너지 가속의 함수이다.The ratio of instantaneous input to short term average tracker is a function of energy acceleration of the continuous inputs.

상기에서,In the above,

a=0.8 및 b=0.97a = 0.8 and b = 0.97

UpperBound은 150%이고 LowerBound은 75%UpperBound is 150% and LowerBound is 75%

Floor는 50% 이고Floor is 50%

Threshold은 165%이다.Threshold is 165%.

값이 상한 보다 크거나 하한 및 플로어(floor)사이에 있다면 갱신되지 않는다는 점에 유의하라. 게다가, 상술된 바와 같이 에너지 가속 입력은 다음중 하나로 계산될 수 있다.Note that if the value is greater than the upper limit or between the lower limit and the floor, it is not updated. In addition, the energy acceleration input can be calculated as one of the following as described above.

연속 입력들의 이중-미분 또는,Double-differentiation of consecutive inputs, or

입력들의 2개의 롤링 평균들의 비를 추정함으로써 추정된다.Estimated by estimating the ratio of two rolling averages of inputs.

고속 및 저속-적응형 롤링 평균들의 비는 연속적인 입력들의 에너지 가속을 반영한다는 점에 유의하라.Note that the ratio of the fast and slow-adaptive rolling averages reflects the energy acceleration of successive inputs.

예를 들어, 상술된 평균들을 위한 기여 비들(contribution rates)은 다음과 같다:For example, contribution rates for the above-described averages are as follows:

(i) 0*mean + 1*input 및,(i) 0 * mean + 1 * input and,

(ii) ((Frame-1)*mean + 1*input)/Frame,(ii) ((Frame-1) * mean + 1 * input) / Frame,

에너지 가속 측정이 제1의 15개 프레임들에 걸쳐 감도를 증가시킨다.Energy acceleration measurements increase sensitivity over the first 15 frames.

서브-대역 검출기는 '전체 스펙트럼' 측정을 위하여 유도된 제2, 제3 및 제4 Mel-필터 뱅크들의 평균을 바람직하게 사용한다. 그 후, 검출기는 후술되는 방식으로 다음 공정을 모든 프레임들에 적용한다.The sub-band detector preferably uses the average of the derived second, third and fourth Mel-filter banks for the 'full spectrum' measurement. The detector then applies the next process to all frames in the manner described below.

(i) Input =p* CurrentInput +(1-p)*PreviousInput;(i) Input = p * CurrentInput + (1- p ) * PreviousInput;

(ii) Frame <15이면, Tracker= MAX(Tracker, Input);(ii) if Frame < 15, Tracker = MAX (Tracker, Input);

(iii) Input<Tracker*UpperBound 및 Input>Tracker*LowerBound이면, Tracker = a*Tracker+(1-a)*Input;(iii) if Input <Tracker * UpperBound and Input> Tracker * LowerBound, then Tracker = a * Tracker + (1-a) * Input;

(iv) Input<Tracker*Floor이면, Tracker = b*Tracker+(1-b)*Input(iv) if Input <Tracker * Floor, Tracker = b * Tracker + (1-b) * Input

(v) Input>Tracker*Threshold이면, TRUE을 출력하거나 그렇치 않다면 FALSE를 출력한다.(v) Output TRUE if Input> Tracker * Threshold, or FALSE otherwise.

서브 영역 측정에서:In subarea measurements:

p=0.75이다. p = 0.75.

모든 다른 파라미터들은 임계값을 제외하면 전체 스펙트럼 측정에 대해서 동일하게 되며, 이는 3.25와 동일하다.All other parameters will be the same for the full spectrum measurement except for the threshold, which is the same as 3.25.

스펙트럼 분산 측정을 위하여, 매 프레임에 대해 이득의 협대역 스펙트럼 표현의 하부 주파수 절반을 포함하는 값들의 분산이 입력으로서 사용된다. 그 후, 검출기는 전체 스펙트럼 측정에 대한 것과 정확하게 동일한 공정을 적용한다.For the spectral variance measurement, the variance of the values including the lower frequency half of the narrowband spectral representation of the gain for each frame is used as input. The detector then applies exactly the same process as for the full spectrum measurement.

분산은 다음과 같이 계산된다.The variance is calculated as follows.

여기서, N = FFT Length/4 이고,Where N = FFT Length / 4,

w_i는 이득의 협대역 스펙트럼 표현의 값들이다.w _i are the values of the narrowband spectral representation of the gain.

본 발명의 바람직한 실시예를 따르면, 상술된 3개의 측정들은 도3의 순서도에 도시된 바와 같이 VAD 결정 알고리즘에 제공된다. 연속적인 입력들은 문맥 분석(contextual analysis)을 제공하는 버퍼에 제공된다. 이는 버퍼의 길이 빼기 1 프레임과 동일한 프레임 지연을 야기시킨다.According to a preferred embodiment of the present invention, the three measurements described above are provided to the VAD decision algorithm as shown in the flowchart of FIG. Successive inputs are provided to a buffer that provides contextual analysis. This causes a frame delay equal to the length of the buffer minus 1 frame.

지금부터 도3을 참조하면, 잡음 환경들에 대한 가속-토대로 한 음성 활동도 검증 공정의 순서도(300)가 본 발명의 바람직한 실시예에 따라서 예시된다.Referring now to FIG. 3, a flowchart 300 of an accelerated-based speech activity verification process for noise environments is illustrated in accordance with a preferred embodiment of the present invention.

N=7 프레임 버퍼에 대해서, 가장 최근의 참/거짓 음 입력은 단계(305)에 도시된 바와 같이 데이터 버퍼내의 위치(N)에 저장된다. 결정 논리는 다수 및 바람직하게는 각각의 다음 단계들을 적용한다.For an N = 7 frame buffer, the most recent true / false input is stored at position N in the data buffer as shown in step 305. The decision logic applies a number and preferably each of the following steps.

단계 1:Step 1:

V_N= Measure 1 또는 Measure 2 또는 Measure 3V _N = Measure 1 or Measure 2 or Measure 3

3개의 측정들 중 한 측정이 참 음 표시를 복귀시키면, 입력(V_N)은 '참'(T)으로 규정된다.If one of the three measurements returns a true indication, the input (V _N ) is defined as 'true' (T).

단계 2:Step 2:

이 알고리즘은 단계(310)에서 처럼, 버퍼에서 '참' 값들의 가장 긴 연속적인 시퀀스를 탐색한다. 그러므로, 예를 들어, 시퀀스 'T T F T T T F'인 경우에, M은 '3'과 동일하게 된다.The algorithm searches for the longest consecutive sequence of 'true' values in the buffer, as in step 310. Therefore, for example, in the case of the sequence 'T T F T T T F', M becomes equal to '3'.

단계 3:Step 3:

M≥S_p및 T<L_s이면, T = Ls이다.If M ≧ S _p and T <L _s, then T = Ls.

여기서, S_p는 단계(315)에서 제1 임계값과 동일하다. 참(T) 음 값들의 가장 긴 시퀀스는 단계(315)에서 제1 임계값과 동일하거나 초과하는데, 즉 S_p=3 또는 보다 연속적인 '참' 값들이면, 버퍼는 '가능한' 음을 포함한다라고 결정된다. 단계(320)에서 결정으로부터 이미 제공(또는 초과)하지 않으면, L_S=5 프레임들(Time_1)의 짧은 타이머(T)가 단계(325)에서 활성화된다.Here, S _p is equal to the first threshold in step 315. The longest sequence of true (T) negative values is equal to or exceeds the first threshold in step 315, i.e. if S _p = 3 or more consecutive 'true' values, the buffer contains a 'possible' negative. Is determined. If not already provided (or exceeded) from the determination at step 320, a short timer T of L _S = 5 frames Time_1 is activated at step 325.

단계 4:Step 4:

M≥S_L및 F>F_S이면, T=L_M그렇치 않다면 T=L_L이다.If M≥S _L and F> F _S, then T = L _M otherwise T = L _L.

여기서, S_L은 단계(330)에서 제2 임계값과 동일하다. S_L=4 또는 보다 연속적인 '참' 값들이 존재하면, 버퍼는 또 다시 '가능한'음을 포함한다라고 결정된다. 단계(355)에서 결정된 바와 같이, 현재 프레임(F)이 초기의 리드-인 안전 기간(F_S) 밖에 있다면, L_M=22 플레임들의 중간 타이머(T)는 단계(340)에서 활성화된다. 그렇치 않다면, L_L=40 프레임들의 안전한 긴 타이머가 단계(345)에서 사용된다. 발음에서 초기에 존재하는 음이 VAD의 초기 잡음 추정을 너무 높게할 때, 이와 같은 배열이 사용된다.Here, S _L is equal to the second threshold in step 330. If S _L = 4 or more consecutive 'true' values are present, it is determined that the buffer again contains 'possible' sounds. As determined in step 355, if the current frame F is outside the initial lead-in safety period F _S , then an intermediate timer T of L _M = 22 frames is activated in step 340. If not, a safe long timer of L _L = 40 frames is used in step 345. This arrangement is used when the earliest existing notes in pronunciation make the initial noise estimate of the VAD too high.

단계 5:Step 5:

M<S_P및 T>0이면, T--이다.If M <S _P and T> 0, then T--.

이 공정이 단계(350)에서 S_p=3 연속적인 '참' 값들 보다 작고 단계(355)에서 타이머가 제로보다 크다라고 결정하면, 타이머는 단계(360)에서 감소된다.If the process determines that at step 350 is less than S _p = 3 consecutive 'true' values and at step 355 the timer is greater than zero, the timer is decremented at step 360.

단계 6:Step 6:

T>0이면, TRUE를 출력하며 그렇치 않다면 FALSE를 출력한다. 타이머가 단계(365)에서 제로보다 크다면, 이 공정은 단계(370)에 도시된 바와 같이 '참' 음 결정을 출력한다. 대안적으로, 타이머가 단계(365)에서 제로보다 크지 않다면, 이 공정은 단계(375)에 도시된 바와 같이 '잡음' 결정을 출력한다.If T> 0, print TRUE; otherwise, print FALSE. If the timer is greater than zero in step 365, the process outputs a 'true' tone decision as shown in step 370. Alternatively, if the timer is not greater than zero in step 365, the process outputs a 'noise' decision as shown in step 375.

단계 7:Step 7:

Frame++, 버퍼를 좌로 시프트하고 단계 1로 복귀하라. 단계(380)에서 다음 프레임에 대해 준비시, 도4와 관련하여 도시된 바와 같이, 버퍼는 좌로 시프트되어 다음 입력을 수용한다. 출력 음 결정은 버퍼로부터 배출된 프레임에 대해 적용된다. 그 후, 이 공정은 데이터 버퍼로의 다음 참/거짓 입력에 대해 단계(305)에서 반복된다.Frame ++, shift the buffer to the left and return to step 1. In preparation for the next frame at step 380, as shown in relation to Figure 4, the buffer is shifted left to accept the next input. Output tone determination is applied to frames ejected from the buffer. This process is then repeated in step 305 for the next true / false input into the data buffer.

상술된 에너지 가속 공정을 토대로 음 또는 잡음 결정을 행하는 대안적인 메커니즘들을 구현하는 것이 고려된다. 예를 들어, 결정 메커니즘은 하나 이상의 타이머(들)를 토대로 하는 것이 아니고, 하나 이상의 에너지 가속 임계값들이 초과되는지에 대해서 만 결정을 행할 수 있다.It is contemplated to implement alternative mechanisms for making negative or noise determinations based on the energy acceleration process described above. For example, the determination mechanism is not based on one or more timer (s) and may only make a determination if one or more energy acceleration thresholds are exceeded.

지금부터 도4를 참조하면, 본 발명의 바람직한 실시예를 따른 버퍼 동작(400)의 예가 보다 상세하게 도시되어 있다. 제1 임계값이 3개의 연속적인 '참' 값들을 위해 설정되었다 라고 가정하자. 시간 't'(410)에서, 단지 현재 입력(프레임 #7)(425) 및 사전 입력(프레임 #6)(420)만이 '참'이라고 가정하자. 결국, 버퍼가 시프트될 때, 제1의 프레임(프레임 #1)(415)은 거짓으로서 표시될 것이다.Referring now to FIG. 4, an example of a buffer operation 400 in accordance with a preferred embodiment of the present invention is shown in more detail. Assume that the first threshold is set for three consecutive 'true' values. Assume at time 't' 410 that only current input (frame # 7) 425 and pre-input (frame # 6) 420 are 'true'. As a result, when the buffer is shifted, the first frame (frame # 1) 415 will be marked as false.

시간 't+1'(430)에서, 제3 '참' 입력(프레임 #8)(450)이 수신되어 이 보다 이른 2개의 '참' 입력들(440, 445)을 보충한다. 결국, 버퍼가 시프트될 때, 다음 출력 프레임(프레임 #2)은 '참'으로서 표시될 것이다.At time 't + 1' 430, a third 'true' input (frame # 8) 450 is received to supplement the two earlier 'true' inputs 440, 445. Eventually, when the buffer is shifted, the next output frame (frame # 2) will be marked as 'true'.

상기 결정 공정에서, 단지 제한들은 (i) Time_1<Time_2<Time_3 및 (ii) Threshold_1<Threshold_2이라는 점에 유의하여야 한다.In the determination process, it should be noted that only the limitations are (i) Time_1 <Time_2 <Time_3 and (ii) Threshold_1 <Threshold_2.

단지 이들 3개의 입력들(프레임 #6, 프레임 #7 및 프레임 #8)만이 '참'이라면, 전체 출력 시퀀스는 다음과 같이 될 것이다.If only these three inputs (frame # 6, frame # 7 and frame # 8) are 'true', the entire output sequence will be as follows.

여기서, 프레임들 #2-#5은 버퍼 리드-인 기능으로 인해 '참'을 나타낸다. 프레임들 #6-#8은 실제 원 '참' 음 입력들의 위치로서 '참'을 나타낸다. 프레임들 #9-#12는 버퍼 리드-아웃 기능으로 인한 '참'을 표시한다. 프레임들 #13-#18은 사용되는 타이머 행오버(hangover)에 응답하여 '참'을 나타낸다. 발음 내의 모든 프레임들이 입력되면, 버퍼는 빌(empty) 때까지 '거짓' 엔트리들(프레임들 #19-#L_M)을 시프트한다.Here, frames # 2- # 5 represent 'true' due to the buffer read-in function. Frames # 6- # 8 represent 'true' as the location of the actual original 'true' note inputs. Frames # 9- # 12 indicate 'true' due to the buffer read-out function. Frames # 13- # 18 indicate 'true' in response to the timer hangover being used. Once all the frames in the pronunciation are entered, the buffer shifts the 'false' entries (frames # 19- # L _M ) until empty.

버퍼 길이 및 행오버 타이머들이 동적으로 조정되어 오디오 통신 유닛의 요구들에 부합하도록 하는 것이 본 발명에서 고려된다. 8의 버퍼 길이 'N' 및 5개의 프레임들의 행오버 타이머를 사용하는 바람직한 실시예는 단지 설명을 위하여서만 사용된다. 그러나, 버퍼 길이 'N'이 항상 N≥S_L이 되도록 결정되어야 한다.It is contemplated in the present invention that the buffer length and hangover timers are dynamically adjusted to meet the needs of the audio communication unit. A preferred embodiment using a buffer length of 'N' of 8 and a hangover timer of 5 frames is used for illustration only. However, the buffer length 'N' must always be determined such that N≥S _L.

본 발명의 권리에서 VAD로서 사용하는 것 이외에도, 도2의 방법적 단계들에서 수행되는 에너지 가속 측정이 다른 파라미터들의 초기화를 검증하는데 사용될 수 있다는 것이 본 발명에서 고려된다. 예를 들어, 스펙트럼 감산 방식은 음의 제1의 10개의 프레임들(전형적으로, 100msec)을 토대로 잡음의 초기 추정을 필요로 한다. 고정 잡음에서조차도, 여러 이벤트들은 초기 추정을 무효화하기 위하여 발생될 수 있다. 이와 같은 이벤트들의 예들은 다음을 포함한다.In addition to using it as a VAD in the present invention, it is contemplated in the present invention that the energy acceleration measurement performed in the method steps of FIG. 2 can be used to verify the initialization of other parameters. For example, the spectral subtraction scheme requires an initial estimate of noise based on the first negative ten frames (typically 100 msec). Even at fixed noise, several events can occur to invalidate the initial estimate. Examples of such events include the following.

(a) 신호의 램프-업(ramp-up)(a) Ramp-up of the signal

각종 다양한 원인들로 인해, 막 개시된 기록은 평가중인 주기 내의 전체 볼륨으로 '램프-업'된다. 그 이유는 이와 같은 전체 램프-업은 디지털 시스템들에서 버퍼-필(buffer-fill), 아날로그 시스템에서 커패시턴스 또는 테이프-헤드 인게이지먼트(tape-head engagement)를 포함하기 때문이다. 이와 같은 이벤트들의 효과는 추정을 무효화한다. 그러므로, 에너지 가속 측정은 이와 같은 램프-업을 검출하여 에러를 방지하는데 사용될 수 있다.For a variety of reasons, the just-initiated record is 'lamped up' to the full volume within the period under evaluation. This is because such an entire ramp-up involves buffer-fill in digital systems, capacitance or tape-head engagement in analog systems. The effect of such events invalidates the estimate. Therefore, energy acceleration measurements can be used to detect such ramp-up and prevent errors.

(b) 초기 신호에서 스파이크들:(b) spikes in the initial signal:

공통 '스파이크(spike)'는 가입자 무선 유닛상의 프레스토크(PTT: Press-To-Talk) 버튼의 완전 배치로 인해 발생되는데, 여기서 전기 접촉은 버튼이 스위치의 배면과 부딪치는 것 보다 먼저 발생된다. 상술된 바와 같은 에너지 가속 측정은 이와 같은 이벤트들이 발생될 때 도2의 단계(225)에서 도시된 바와 같이 추정 공정을 중지하는데 사용될 수 있다.A common 'spike' occurs due to the complete placement of a Press-To-Talk (PTT) button on the subscriber radio unit, where electrical contact occurs before the button strikes the back of the switch. Energy acceleration measurements as described above may be used to stop the estimation process as shown in step 225 of FIG. 2 when such events occur.

(c) 초기 신호에서의 음:(c) the sound at the initial signal:

PTT 시스템들로 인한 또 다른 공통 발생은, 사용자가 PTT 버튼을 누르자 마자 말을 시작한다는 것이다. 이 방식에서, 전기 접촉은 말을 한 후에 행해진다. 에너지 가속 측정은 도2의 단계(255)에 도시된 바와 같이 이를 식별하여 잡음-토대로 초기화들을 중지할 수 있거나 디폴트 추정들을 사용하도록 할 수 있다.Another common occurrence with PTT systems is that the user starts talking as soon as the user presses the PTT button. In this way, electrical contact is made after talking. The energy acceleration measurement may identify it as shown in step 255 of FIG. 2 to stop the initializations as noise-based or to use default estimates.

요약하면, 음성 활동도 메커니즘을 갖는 오디오 처리 유닛을 포함하는 통신 유닛이 서술되었다. 음성 활동도 검출 메커니즘은 통신 유닛으로의 신호 입력의 에너지 가속의 표시를 제공하고, 상기 입력 신호가 상기 표시를 토대로 음인지 또는 잡음인지를 결정한다.In summary, a communication unit has been described that includes an audio processing unit having a voice activity mechanism. The speech activity detection mechanism provides an indication of energy acceleration of the signal input to the communication unit and determines whether the input signal is sound or noise based on the indication.

게다가, 통신 유닛으로의 음 신호 입력을 검출하는 방법이 서술되었다. 이 방법은 통신 유닛으로의 입력 신호의 가속을 표시하는 단계; 및, 상기 입력 신호가 상기 표시 단계를 토대로 음인지 또는 잡음인지를 결정하는 단계를 포함한다.In addition, a method of detecting a sound signal input to a communication unit has been described. The method includes indicating acceleration of an input signal to the communication unit; And determining whether the input signal is sound or noise based on the display step.

게다가, 통신 유닛으로의 신호 입력이 음인지 또는 잡음인지를 결정하는 방법이 서술되었다. 이 방법은 상기 입력 신호가 예를 들어, 다수의 입력 신호들의 프레임 평균 또는 롤링 평균을 사용하여, 에너지 가속을 토대로 음인지 또는 잡음인지를 결정하는 단계를 포함한다.In addition, a method has been described for determining whether a signal input to a communication unit is negative or noisy. The method includes determining whether the input signal is negative or noise based on energy acceleration, using, for example, a frame average or rolling average of a plurality of input signals.

그러므로, 상술된 잡음 환경들에 대한 음 활동도 검출기 및 밸리데이터가 잡음 견고성 및 고속 응답의 장점들을 제공한다는 것을 알 수 있을 것이다. 바람직한 실시예가 절대 측정값 대신에 에너지 가속에 따른 측정값을 사용하기 때문에, 본원에 서술된 본 발명의 개념은 어떠한 입력 레벨의 음에도 적용될 수 있다.Therefore, it will be appreciated that the sound activity detector and valley data for the noise environments described above provide the advantages of noise robustness and fast response. Since the preferred embodiment uses measurements according to energy acceleration instead of absolute measurements, the inventive concept described herein can be applied to any input level of sound.

본 발명의 실시예들의 특정하고 바람직한 구현 방식들이 상술되었지만, 본 발명의 영역 내에 있는 이와 같은 발명의 개념을 변경 및 수정할 수 있다는 것이 당업자에게는 명백하다.Although specific and preferred implementation manners of embodiments of the present invention have been described above, it will be apparent to those skilled in the art that modifications and variations of the inventive concept are within the scope of the present invention.

따라서, 잡음 환경들에 대한 개선된 음 활동도 검출기 및 밸리데이터는 종래 기술과 관련된 상술된 단점들을 실질적으로 경감할 수 있다.Thus, improved sound activity detector and valley data for noise environments can substantially mitigate the aforementioned disadvantages associated with the prior art.

Claims

A communication unit 100 comprising an audio processing unit 109 having a voice activity detection mechanism 130, 135,

The voice activity detection mechanisms 130 and 135 measure the energy acceleration of the signal input to the communication unit 100 to determine whether the input signal is speech or noise based on the measurement. Communication unit.

The voice activity detection mechanism of claim 1, wherein the voice activity detection mechanism includes a voice activity detection function 130 that performs detection every frame on input of signals to the voice activity detection mechanisms 130 and 135. Communication unit.

The method of claim 2, wherein the detection per frame comprises the following frequency ranges:

(i) full spectrum;

(ii) spectral sub-bands; And,

(iii) performing energy acceleration measurements on signal input to the voice activity detection mechanism (130, 135) for one or more ranges of spectral variance.

4. The voice activity detection mechanism of claim 3, wherein the voice activity detection mechanism comprises a voice activity determination function 135 operatively coupled to the voice activity detection function 130, such that the input signal is one of the measurements. A communication unit that determines whether it is negative based on the buffering operation of one or more measurements.

5. The communication unit according to claim 4, wherein said voice activity determining function (135) determines whether an input signal is negative using a frame average or rolling average of a plurality of said input signals.

The communication unit according to claim 2, wherein if the energy acceleration measurement yields an energy acceleration value greater than the energy acceleration threshold, the input frame is estimated (265) as a negative frame.

7. The communication unit of claim 6, wherein the determination (265) that the input frame is a negative frame is applied retrospectively to a previous frame in a buffer of input signals.

8. A communication unit according to claim 6 or 7, wherein if the energy acceleration measurement yields an energy acceleration value that is greater than the energy acceleration threshold over a plurality of consecutive frames, the input signal is estimated (370) as a negative signal.

The communication unit according to any one of claims 3 to 8, wherein if a sub-region of the input signal spectrum is selected, the selection is based on the sub-region most likely to include a basic pitch of the speech signal.

The voice activity detection mechanism (130, 135) of any of the preceding claims, wherein the voice activity detection mechanism (130, 135) uses acceleration of voice-energy related features, eg, other voice or noise related metrics, eg. For example, the communication unit which validates parameter initialization of a spectrum subtraction method.

A method of detecting a sound signal input to a communication unit,

Measuring acceleration or change in energy of the input signal to the communication unit; And,

Determining (315, 330, 350) whether the input signal is sound (370) or noise (375) based on the measuring step.

12. The method of claim 11, further comprising performing detection every frame on input of signals to the communication unit.

13. The method of claim 12, wherein in each frame the detection is:

The following frequency ranges:

(i) full spectrum;

(ii) spectral sub-bands; And,

(iii) performing an energy acceleration measurement on the input signal for one or more ranges of spectral variance, the sound signal input detection method.

14. A method according to any of claims 11 to 13, for determining whether a signal input to a communication unit is negative or noise.

Determining (315) whether the input signal is negative (370) or noise (375) based on energy acceleration or change in energy measurement of the input signal using, for example, a frame average or rolling average of multiple input signals. 330, 350, further comprising determining whether the signal input to the communication unit is negative or noisy.

The method of claim 14, wherein the determining step,

If the energy acceleration measurement yields an energy acceleration value that is greater than an energy acceleration threshold, determining (265) that the input frame is a negative frame; And,

Retroactively applying the determination to a previous frame in a buffer of input signals.