KR20040030638A

KR20040030638A - Detecting voiced and unvoiced speech using both acoustic and nonacoustic sensors

Info

Publication number: KR20040030638A
Application number: KR10-2003-7015511A
Authority: KR
Inventors: 그레고리씨. 버넷
Original assignee: 앨리프컴
Priority date: 2001-05-30
Filing date: 2002-05-30
Publication date: 2004-04-09
Also published as: KR100992656B1; EP1415505A1; CA2448669A1; CN1513278A; JP2005503579A

Abstract

여러 레벨의 배경 잡음을 가진 음향 신호에서 유성음 및 무성음 스피치를 감지하기 위한 시스템 및 방법이 제공된다. 이 시스템(도 3)은 두개의 마이크로폰(Mic1, Mic2)에서 음향 신호들을 수신하여, 두 마이크로폰(Mic1, Mic2) 각각에서 수신한 음향 신호들간에서 차이 매개변수를 발생시킨다. 차이 매개변수는 수신 음향 신호의 일부분들간 신호 이득의 상대적 차이를 나타낸다. 상기 차이 매개변수가 제 1 한도를 넘을 때 시스템은 음향 신호 정보를 무성음 스피치로 분류하고, 상기 차이 매개변수가 제 2 한도를 넘을 때 시스템은 음향 신호 정보를 유성음 스피치로 분류한다. 더욱이, 발명에 따른 시스템의 실시예들은 유성음 스피치 분류를 돕기 위해 생리학적 정보를 수신하는 비-음향 센서(20)를 포함한다.A system and method are provided for detecting voiced and unvoiced speech in an acoustic signal having several levels of background noise. This system (FIG. 3) receives acoustic signals at two microphones Mic1 and Mic2, generating a difference parameter between the acoustic signals received at each of the two microphones Mic1 and Mic2. The difference parameter represents the relative difference in signal gain between portions of the received acoustic signal. The system classifies the acoustic signal information as unvoiced speech when the difference parameter exceeds the first limit, and when the difference parameter exceeds the second limit, the system classifies the acoustic signal information as voiced speech. Moreover, embodiments of the system according to the invention include a non-acoustic sensor 20 that receives physiological information to assist in voiced speech classification.

Description

System and method for detecting voiced and unvoiced sound using acoustic and non-acoustic sensors {DETECTING VOICED AND UNVOICED SPEECH USING BOTH ACOUSTIC AND NONACOUSTIC SENSORS}

유성음 스피치와 무성음 스피치를 정확하게 구분하는 능력은 스피치 인지, 화자 확인, 잡음 억제, 등을 포함한 여러 스피치 응용프로그램에 있어 중요하다. 전형적인 음향 장치에서, 화자로부터의 스피치가 캡처되어, 또다른 위치의 수신기에 전송된다. 스피치 신호나 관심대상 신호를 불필요한 음향 잡음으로 오염시키는 한개 이상이 잡음원이 화자 환경에 존재할 수 있다. 이로 인해, 수신자나 수신기가 사용자 스피치를 알아내기 어렵거나 불가능할 수 있다.The ability to accurately distinguish between voiced and unvoiced speech is important for many speech applications, including speech recognition, speaker identification, noise suppression, and more. In a typical acoustic device, speech from the speaker is captured and sent to a receiver in another location. One or more noise sources may be present in the speaker environment that contaminate the speech signal or the signal of interest with unwanted acoustic noise. This may make it difficult or impossible for the receiver or receiver to find the user speech.

유성음 및 무성음 스피치를 분류하는 전형적인 방법은 마이크로폰 데이터의 음향 콘텐트에 주로 의존하고 있다. 이 방식은 잡음으로 인한 문제점에 시달리고 있으며, 이에 따라 신호 콘텐트에 불확실성이 존재한다. 이는 셀방식 전화와 PDA같은 이동 통신 장치들이 폭넓게 보급되면서 더욱 문제가 되고 있다. 왜냐하면, 대부분의 경우에, 이 장치들에 의해 제공되는 서비스 품질이 이 장치에 의해 제공되는 음성 서비스 품질에 따라 좌우되기 때문이다. 스피치 신호에 존재하는 잡음을 억제하는 방법이 당 분야에 공지되어 있으나, 이 방법들은 성능상의 단점을 가진다. 즉, 연산시간이 길고, 신호 처리 수행에 성가신 하드웨어를 필요로하며, 관심 대상 신호를 왜곡시키는 등의 단점이 있다.Typical methods for classifying voiced and unvoiced speech rely primarily on the acoustic content of microphone data. This approach suffers from noise problems, and there is uncertainty in the signal content. This becomes even more problematic as mobile communication devices such as cellular telephones and PDAs become widespread. This is because, in most cases, the quality of service provided by these devices depends on the quality of voice service provided by this device. Although methods for suppressing noise present in speech signals are known in the art, these methods have performance drawbacks. In other words, the operation time is long, annoying hardware is required to perform signal processing, and the signal of interest is distorted.

본 발명은 스피치 신호(speech signals) 처리에 관한 것이다.The present invention relates to speech signals processing.

도 1은 발명의 한 실시예에 따른 NAVSAD 시스템의 블록도표.1 is a block diagram of a NAVSAD system according to one embodiment of the invention.

도 2는 발명의 한 실시예에 따른 PSAD 시스템의 블록도표.2 is a block diagram of a PSAD system according to one embodiment of the invention.

도 3은 발명의 한 실시예에 따른, 패스파인더(Pathfinder system)이라 불리는 잡음제거 시스템의 블록도표.3 is a block diagram of a noise cancellation system called a Pathfinder system, in accordance with an embodiment of the invention.

도 4는 발명의 한 실시예에 따라, 유성음 및 무성음 스피치를 감지하는 데 사용하기 위한 감지 알고리즘의 순서도.4 is a flowchart of a detection algorithm for use in detecting voiced and unvoiced speech, in accordance with an embodiment of the invention.

도 5A는 발화에 대한 수신 GEMS 신호(502)를, GEMS 신호와 Mic1 신호간 평균 상관(504)과, 유성음 스피치 감지에 사용되는 한도 T1과 함께 도시하는 그래프.FIG. 5A is a graph showing a received GEMS signal 502 for speech, with an average correlation 504 between the GEMS signal and the Mic1 signal, and the limit T1 used for voiced speech detection. FIG.

도 5B는 발화에 대한 수신 GEMS 신호(502)를, GEMS 신호의 표준 편차(506)와, 유성음 스피치 감지에 사용된 한도 T2와 함께 도시하는 그래프.FIG. 5B is a graph showing the received GEMS signal 502 for speech, along with the standard deviation 506 of the GEMS signal, and the limit T2 used for voiced speech detection.

도 6은 음향 또는 오디오 신호(608)로부터 감지되는 유성음 스피치(602)를 GEMS 신호(604) 및 음향 잡음(606)과 함께 도시하는 그래프.FIG. 6 is a graph showing voiced speech 602 sensed from an acoustic or audio signal 608 along with a GEMS signal 604 and acoustic noise 606.

도 7은 PSAD 시스템의 한 실시예 하에서 이용하기 위한 마이크로폰 어레이 도면.7 is a microphone array diagram for use under one embodiment of a PSAD system.

도 8은 한 실시예 하에서 여러 Δd 값에 대해 d1에 대한 ΔM의 그래프.8 is a graph of ΔM versus d 1 for several Δd values under one embodiment.

도 9는 마이크로폰(1)으로부터의 음향 데이터나 오디오와 H1(z)의 절대값의합계로 이득 매개변수의 그래프.9 is a graph of gain parameters as the sum of the acoustic data or audio from the microphone 1 and the absolute value of H1 (z).

도 10은 도 9에 제시된 음향 데이터의 대안의 그래프.10 is an alternative graph of the acoustic data shown in FIG. 9;

배경 잡음으로부터 유성음 및 무성음 스피치를 구분하기 위한 시스템 및 방법이 아래에 제공된다. 즉, NAVSAD(Non-Acoustic Sensor Voiced Speech Activity Detection) 시스템과 PSAD(Pathfinder Speech Activity Detection) 시스템이 제공된다. 여기서 제공되는 잡음 제거 및 감소 방법은, 배경 잡음으로부터 무성음 및 유성음 스피치를 구분하고 분류하면서도, 왜곡없이 관심있는 음향 신호를 클리닝함으로서 당 분야에 전형적인 시스템의 단점을 극복한다.A system and method for distinguishing voiced and unvoiced speech from background noise are provided below. That is, a non-acoustic sensor voiced speech activity detection (NAVSAD) system and a pathfinder speech activity detection (PSAD) system are provided. The noise reduction and reduction method provided herein overcomes the disadvantages of systems typical of the art by cleaning acoustic signals of interest without distortion, while distinguishing and classifying unvoiced and voiced speech from background noise.

도 1은 발명의 한 실시예에 따른 NAVSAD 시스템(100)의 블록도표이다. NAVSAD 시스템은 마이크로폰(10)과 센서(20)를 한개 이상의 프로세서(30)에 연결한다. 한 실시예의 센서(20)는 발성 활동 감지기나 무성음 센서를 포함한다. 프로세서(30)는 감지 알고리즘이라 불리는 감지 서브시스템(50)과, 잡음제거 서브시스템(40)을 포함하는 서브시스템들을 제어한다. 잡음제거 서브시스템(40)의 동작은 관련 장치 단락에서 상세하게 설명된다. NAVSAD 시스템은 어떤 배경 음향 잡음 환경에서도 매우 잘 동작한다.1 is a block diagram of a NAVSAD system 100 in accordance with one embodiment of the invention. The NAVSAD system connects the microphone 10 and sensor 20 to one or more processors 30. Sensor 20 in one embodiment includes a voice activity sensor or an unvoiced sensor. The processor 30 controls the subsystem including the sensing subsystem 50, called the sensing algorithm, and the noise canceling subsystem 40. The operation of the noise canceling subsystem 40 is described in detail in the relevant apparatus section. The NAVSAD system works very well in any background acoustic noise environment.

도 2는 발명의 한 실시예에 따른 PSAD 시스템(200)의 블록도표이다. PSAD 시스템은 마이크로폰(10)을 한개 이상의 프로세서(30)에 연결한다. 프로세서(30)는 감지 알고리즘이라 불리는 감지 서브시스템(50)과, 잡음제거 서브시스템(40)을 포함한다. PSAD 시스템은 저음향 잡음 환경에서 매우 민감하며, 고음향 잡음 환경에서 비교적 덜 민감하다. PSAD는 독립적으로, 또는, NAVSAD에 대한 백업으로 동작할 수 있다. 상기 백업 기능은 NAVSAD가 실패할 경우 유성음 스피치를 감지하는 기능이다.2 is a block diagram of a PSAD system 200 according to one embodiment of the invention. The PSAD system couples the microphone 10 to one or more processors 30. The processor 30 includes a sensing subsystem 50 called a sensing algorithm and a noise canceling subsystem 40. PSAD systems are very sensitive in low acoustic noise environments and relatively less sensitive in high acoustic noise environments. The PSAD can operate independently or as a backup to NAVSAD. The backup function detects voiced speech when NAVSAD fails.

발명의 한 실시예에 따른 NAVSAD와 PSAD 시스템의 감지 서브시스템(50)과 잡음제거 서브시스템(50)은 프로세서(30)에 의해 제어되는 알고리즘이지만 이에 제한되지는 않는다. NAVSAD와 PSAD 시스템의 대안의 실시예들은 추가적인 하드웨어, 펌웨어, 그리고 소프트웨어, 및 이들의 조합으로 구성되는 감지 서브시스템(50)이나 잡음제거 서브시스템(40)을 포함할 수 있다. 더욱이, 감지 서브시스템(50)과 잡음제거 서브시스템(40)의 기능들은 NAVSAD 및 PSAD 시스템의 여러 구성요소들 사이에 분포될 수 있다.The sensing subsystem 50 and the noise canceling subsystem 50 of the NAVSAD and PSAD systems according to one embodiment of the invention are algorithms controlled by the processor 30, but are not limited thereto. Alternative embodiments of the NAVSAD and PSAD systems may include the sensing subsystem 50 or the noise canceling subsystem 40, which consists of additional hardware, firmware, and software, and combinations thereof. Moreover, the functions of sensing subsystem 50 and noise canceling subsystem 40 may be distributed among the various components of the NAVSAD and PSAD systems.

도 3은 발명의 한 실시예에 따른 패스파인더(Pathfinder)라 불리는 잡음제거 서브시스템(300)의 블록도표이다. 패스파인더 시스템은 아래에 간략하게 설명되며, 관련 장치 단락에서 상세하게 설명된다. 두개의 마이크로폰 Mic1과 Mic2가 패스파인더 시스템에 사용되며, Mic1은 "신호" 마이크로폰으로 간주된다. 도 1을 참고할 때, 발성 활동 감지기(VAD)(320)가 비음향 유성음 센서(20)이고 잡음 제거 서브시스템(340)이 감지 서브시스템(50)과 잡음제거 서브시스템(40)을 포함할 때, 패스파인더 시스템(300)이 NAVSAD 시스템(100)과 대등하다. 도 2를 참고할 때, VAD(320)가 없고 잡음 제거 서브시스템(340)이 감지 서브시스템(50)과 잡음제거 서브시스템(40)을 포함할 때, 패스파인더 시스템(300)이 PSAD 시스템(200)과 대등하다.3 is a block diagram of a noise cancellation subsystem 300 called a pathfinder in accordance with one embodiment of the invention. The pathfinder system is briefly described below and described in detail in the relevant apparatus section. Two microphones Mic1 and Mic2 are used in the Pathfinder system, and Mic1 is considered a "signal" microphone. Referring to FIG. 1, when the voice activity detector (VAD) 320 is a non-acoustic voiced sensor 20 and the noise canceling subsystem 340 includes a sensing subsystem 50 and a noise canceling subsystem 40. The pathfinder system 300 is equivalent to the NAVSAD system 100. Referring to FIG. 2, when the VAD 320 is absent and the noise canceling subsystem 340 includes the sensing subsystem 50 and the noise canceling subsystem 40, the pathfinder system 300 is the PSAD system 200. Equivalent to

NAVSAD와 PSAD 시스템은 두가지의 상용 접근법을 지원한다. 즉, 비교적 저렴한 PSAD 시스템은 가장 낮은 잡음 환경에서부터 중간 잡음 환경에서 기능하는 음향 접근법을 지원하며, NAVSAD 시스템은 어떤 환경에서도 유성음 스피치를 감지할 수 있도록 비-음향 센서를 추가한다. 일반적으로 무성음 스피치는 상기 센서를 이용하여 감지되지 않는다. 왜냐하면, 무성음 스피치는 사람의 조직을 충분히 진동시키지 않기 때문이다. 그러나 고잡음 환경에서는 무성음 스피치를 감지하는 것이 중요하지 않다. 왜냐하면, 무성음 스피치의 에너지가 매우 작으며 잡음에 의해 쉽게 씻겨나가기 때문이다. 따라서 고잡음 환경에서는 무성음 스피치가 유성음 스피치 잡음제거에 영향을 거의 미치지 않는다. 따라서, 무성음 스피치 정보는 거의 잡음이 없는 상황에서 가장 중요하며, 따라서, 무성음 감지는 저잡음 환경에서 매우 민감하여야 할 것이며 고잡음 상황에서는 둔감하여야 할 것이다. 이는 쉽게 달성되지 않으며, 당 분야에 공지된 비교가능한 음향 무성음 감지기들은 이러한 환경적 제약 하에서 동작할 수 없다.NAVSAD and PSAD systems support two commercial approaches. That is, relatively inexpensive PSAD systems support acoustic approaches that function from the lowest to the middle noise environments, while the NAVSAD system adds non-acoustic sensors to detect voiced speech in any environment. In general, unvoiced speech is not detected using the sensor. This is because unvoiced speech does not sufficiently vibrate a person's tissue. However, it is not important to detect unvoiced speech in high noise environments. This is because the energy of unvoiced speech is very small and easily washed away by noise. Thus, in high noise environments, unvoiced speech has little effect on voiced speech noise cancellation. Therefore, unvoiced speech information is most important in the absence of almost no noise, so unvoiced speech detection should be very sensitive in low noise environments and insensitive in high noise situations. This is not easily achieved and comparable acoustic unvoiced detectors known in the art cannot operate under such environmental constraints.

NAVSAD와 PSAD 시스템들은 두 마이크로폰의 신호들간 관계를 연산하기 위해 두 마이크로폰간 주파수 콘텐트 차이를 이용하는 스피치 감지에 대한 어레이 알고리즘을 포함한다. 이는 "감도 영역" 외부의 잡음을 제거하기 위해 각 마이크로폰의 시간/위상 차를 이용하려 시도하는 기존 어레이들과 대조를 이룬다. 여기서 소개되는 방법들은 신호에 대한 어레이의 특정 방향을 필요로하지 않음에 따라 상당한 장점을 제공한다.NAVSAD and PSAD systems include an array algorithm for speech sensing that uses the frequency content difference between two microphones to compute the relationship between the signals of the two microphones. This contrasts with existing arrays that attempt to use the time / phase difference of each microphone to remove noise outside the “sensitivity area”. The methods introduced here offer significant advantages as they do not require a specific orientation of the array for the signal.

더욱이, 기존의 어레이들은 특정 잡음 방향에만 의존함에 반해, 여기서 소개되는 시스템들은 모든 종류, 모든 방향의 잡음에 민감하다. 결과적으로, 여기서 제시되는 주파수-기반 어레이는 독자적이다. 왜냐하면, 마이크로폰에 대한 잡음 및 신호의 방향에 전혀 의존하지 않으면서, 두 마이크로폰의 상대적 방향에만 의존하기 때문이다. 이로 인해, 잡음 종류, 마이크로폰, 잡음원/신호원과 마이크로폰간의 방향에 대해 견고한 신호 처리 시스템을 얻을 수 있다.Moreover, while existing arrays rely only on specific noise directions, the systems introduced here are sensitive to noise of all kinds and in all directions. As a result, the frequency-based array presented here is unique. This is because it depends only on the relative direction of the two microphones, without depending on the direction of the signal and the noise with respect to the microphone. This results in a robust signal processing system with respect to noise type, microphone, and direction between the noise source / signal source and the microphone.

여기서 소개되는 시스템들은 입력 신호의 발성 상태 결정을 위해 관련 장치 단락에 설명되는 비-음향 센서나 패스파인더 잡음 억제 시스템으로부터 얻은 정보를 이용한다. 발성 상태(voicing state)는 침묵, 유성음, 무성음 상태를 포함한다. 예를 들어 NAVSAD 시스템은 스피치에 관련된 인체 조직의 진동을 감지하기 위해 비음향 센서를 포함한다. 한 실시예에 따른 비음향 센서는 GEMS(General Electromagnetic Movement Sensor)로서, 이에 제한되지는 않는다. 그러나 대안의 실시예는 스피치에 관련된 인체 조직 움직임을 감지할 수 있으면서 환경적 음향 잡음에 영향받지 않는 어떤 센서도 사용할 수 있다.The systems introduced here use information from non-acoustic sensors or pathfinder noise suppression systems described in the relevant device paragraphs to determine the vocalization of the input signal. Voicing states include silent, voiced and unvoiced states. For example, the NAVSAD system includes a non-acoustic sensor to sense vibrations of human tissue related to speech. The non-acoustic sensor according to an embodiment is a general electromagnetic movement sensor (GEMS), but is not limited thereto. However, alternative embodiments may use any sensor that can detect human tissue movement related to speech while being immune to environmental acoustic noise.

GEMS는 RF 소자(2.4GHz)로서, 인체 조직의 유전체 인터페이스 움직임을 감지할 수 있다. GEMS는 타겟 움직임에 관련된 작은 위상 편위를 감지하기 위해 호모다인 믹싱(homodyne mixing)을 이용하는 RF 간섭계이다. 본질적으로는, 센서가 센서 주변에 무엇이든 반사하는 약한 전자기파(1밀리와트 미만)를 내보낸다. 반사파는 원 송신파와 믹싱되고, 타겟 위치 변화에 대해 그 결과가 분석된다. 센서 주변에서 움직이는 어떤 것도 반사파의 위상에 변화를 일으킬 것이고, 이는 센서로부터의 전압 출력으로 증폭되어 디스플레이될 것이다. 유사한 센서가 Gregory C. Burnett의1999년 University of California, Davis에서의 박사논문, "The Physiological Basis of Glottal Electromagnetic Micropower Sensors(GEMS) and Their Use in Defining an Excitation Function for the Human Vocal Tract"에 소개되어 있다.GEMS is an RF device (2.4GHz) that can detect the movement of the dielectric interface of human tissue. GEMS is an RF interferometer that uses homodyne mixing to detect small phase excursions related to target motion. In essence, the sensor emits weak electromagnetic waves (less than 1 milliwatt) that reflect anything around the sensor. The reflected wave is mixed with the original transmission wave and the result is analyzed for the target position change. Anything moving around the sensor will change the phase of the reflected wave, which will be amplified and displayed by the voltage output from the sensor. A similar sensor is introduced in Gregory C. Burnett's 1999 Thesis at the University of California, Davis, "The Physiological Basis of Glottal Electromagnetic Micropower Sensors (GEMS) and Their Use in Defining an Excitation Function for the Human Vocal Tract". .

도 4는 발명의 한 실시예에 따라 유성음 및 무성음 스피치를 감지하는 데 사용하기 위한 감지 알고리즘(50)의 순서도다. 도 1 및 2를 참고할 때, NAVSAD와 PSAD 시스템은 감지 서브시스템(50)으로 감지 알고리즘(50)을 포함한다. 이 감지 알고리즘(50)은 실시간으로 동작하며, 한 실시예에서, 20 밀리초 윈도로 동작하고 한번에 10밀리초씩 계단화되지만 이에 제한되지는 않는다. 발성 활동 결정은 처음 10밀로초동안 레코딩되고, 다음의 10밀리초는 "룩-어헤드(look-ahead)" 버퍼로 기능한다. 한 실시예가 20/10 윈도를 이용하지만, 대안의 실시예는 다른 수많은 조합의 윈도 값을 이용할 수 있다.4 is a flow diagram of a detection algorithm 50 for use in detecting voiced and unvoiced speech in accordance with one embodiment of the invention. Referring to FIGS. 1 and 2, the NAVSAD and PSAD systems include a sensing algorithm 50 as the sensing subsystem 50. This sensing algorithm 50 operates in real time and, in one embodiment, operates with a 20 millisecond window and steps, but is not limited to, 10 milliseconds at a time. Voice activity decisions are recorded for the first 10 milliseconds, and the next 10 milliseconds serve as a "look-ahead" buffer. While one embodiment uses 20/10 windows, alternative embodiments may use many other combinations of window values.

감지 알고리즘(50)을 발전시키는 데 있어, 다수의 다차원 요소를 고려할 수 있다. 관련 기술 단락에서 설명되는 패스파인더 잡음제거 기술의 효과를 유지하는 데 가장 고심하여야 한다. 패스파인더 성능은 적응성 필터 트레이닝이 잡음에 대해서보다 스피치에 대해 수행될 경우 절충될 수 있다. 따라서, 이러한 혼란을 최소한으로 유지하면서 VAD(발성 활동 결정)로부터 상당한 양의 스피치를 배제하지 않는 것이 중요하다.In developing the sensing algorithm 50, a number of multidimensional elements may be considered. The most painstaking must be to maintain the effectiveness of the pathfinder noise canceling technique described in the related technical section. Pathfinder performance can be compromised if adaptive filter training is performed on speech rather than noise. Therefore, it is important not to exclude a significant amount of speech from the VAD (uttering activity determination) while keeping this confusion to a minimum.

유성음 및 무성음 스피치 신호간 특성화의 정확도와, 잡음 신호로부터 이들 각 스피치 신호를 구분하는 것에 또한 고심하여야 한다. 이 종류의 특성화는 스피치 인지와 화자 확인같은 장치에 유용할 수 있다.Care should also be taken in the accuracy of the characterization between voiced and unvoiced speech signals, and in distinguishing each of these speech signals from the noise signal. This kind of characterization can be useful for devices such as speech recognition and speaker identification.

더욱이, 발명의 한 실시예에 따른 감지 알고리즘을 이용한 시스템들은 변화하는 양의 배경 음향 잡음을 포함한 환경에서 기능한다. 비음향 센서가 가용할 경우, 외부 잡음은 유성음 스피치에 대해 문제가 되지 않는다. 그러나, 무성음 스피치의 경우, (그리고 비음향 센서가 가용하지 않거나 오기능할 경우 유성음 스피치의 경우), 무성음 스피치로부터 잡음을 분리해내기 위해 음향 데이터 자체에 의존도가 놓인다. 패스파인더 잡음 억제 시스템의 한 실시예에서 두 마이크로폰을 이용할 때 장점이 생기며, 마이크로폰간 공간적 관계가 무성음 스피치 감지를 돕는 데 사용된다. 그러나, 스피치가 거의 감지불가능하고 음향만의 방식(acoustic-only method)이 실패하기에 충분할만큼 큰 잡음 수준이 존재하는 경우가 자주 있을 수 있다. 이 상황에서, 우수한 성능 보장을 위해 비음향 센서(차후로는 그냥 "센서")가 요구될 것이다.Moreover, systems using detection algorithms in accordance with one embodiment of the invention function in an environment that includes varying amounts of background acoustic noise. If a non-acoustic sensor is available, external noise is not a problem for voiced speech. However, in the case of unvoiced speech (and in the case of voiced speech when a non-acoustic sensor is not available or malfunctioning), a dependency is placed on the acoustic data itself to separate the noise from the unvoiced speech. In one embodiment of the pathfinder noise suppression system, there are advantages when using two microphones, and the spatial relationship between the microphones is used to help detect unvoiced speech. However, it can often be that there is a level of noise that is large enough for speech to be almost undetectable and the acoustic-only method fails. In this situation, a non-acoustic sensor (hereafter just a "sensor") will be required to ensure good performance.

2-마이크로폰 시스템에서, 스피치원은 한 마이크로폰에서보다 다른 한 마이크로폰에서 상대적으로 소리가 커야만한다. 어떤 잡음도 단위값 수준의 이득을 가진 H1을 유발하여야 함에 따라, 마이크로폰이 머리에 놓일 때 기존 마이크로폰으로도 이 요건에 쉽게 부합함을 테스트를 통해 알 수 있었다.In a two-microphone system, the speech source should be relatively loud in one microphone than in one microphone. As any noise had to cause H1 with unit-level gain, the test showed that the existing microphone could easily meet this requirement when the microphone was in the head.

NAVSAD 시스템과 도 1 및 3을 참고할 때, NAVSAD는 유성음 스피치 감지를 위해 두개의 매개변수에 의존한다. 이 두 매개변수들은 표준 편차(SD)에 의해 한 실시예에서 결정되는 관심 윈도의 센서 에너지와, 부가적으로, 마이크로폰(1)과 센서 데이터로부터의 음향 신호와 센서 데이터간 교차상관(XCORR)을 포함한다. 센서 에너지는 다수의 방식 중 한가지로 결정될 수 있고, SD는 에너지 결정을 위한 단 한가지 편리한 방식이다.Referring to the NAVSAD system and FIGS. 1 and 3, NAVSAD relies on two parameters for voiced speech detection. These two parameters are related to the sensor energy of the window of interest determined in one embodiment by the standard deviation (SD), and in addition, the cross-correlation (XCORR) between the acoustic signal and the sensor data from the microphone 1 and the sensor data. Include. Sensor energy can be determined in one of a number of ways, and SD is the only convenient way to determine energy.

센서의 경우, 표준 편차(SD)는 발성 상태에 어느정도 정확하게 대응하는 신호 에너지에 가깝지만, 움직임 잡음(사용자에 대한 센서의 상대적 움직임)과 전자기파 잡음에 빠지기 쉽다. 조직 움직임으로부터 센서 잡음을 추가적으로 분리하기 위해, XCORR 이 사용될 수 있다. XCORR은 15회 지연으로 계산되며, 이는 8000 Hz에서 2밀리초 하의 연산에 해당한다.In the case of a sensor, the standard deviation (SD) is close to the signal energy corresponding to some degree of accuracy in the vocal state, but is susceptible to motion noise (relative movement of the sensor to the user) and electromagnetic noise. To further separate sensor noise from tissue movement, XCORR can be used. XCORR is calculated with 15 delays, which corresponds to an operation under 2 milliseconds at 8000 Hz.

XCORR은 센서 신호가 어떤 방식으로 왜곡되거나 변조될 때 유용할 수도 있다. 예를 들어, 스피치 생성이 감지될 수 있으나 신호가 부정확하거나 왜곡된 시간-기반 정보를 가질 수 있는 센서 위치(가령, 턱이나 목 뒤편)가 존재한다. 즉, 음향 파형과 부합되는(시간상 잘 형성된) 특징, 형태들을 가지지 못할 수 있다. 그러나, XCORR은 음향 잡음으로부터의 오류에 빠지기 더욱 쉬우며, 음향 잡음이 높은 환경에서는 XCORR이 거의 소용없다. 따라서, XCORR이 발성 정보의 유일한 소스가 되어서는 안된다.XCORR may be useful when the sensor signal is distorted or modulated in some way. For example, there is a sensor location (eg, behind the jaw or neck) where speech generation may be detected but the signal may have inaccurate or distorted time-based information. That is, they may not have features and shapes that match the acoustic waveform (well formed in time). However, XCORR is more susceptible to errors from acoustic noise, and XCORR is of little use in high acoustic noise environments. Therefore, XCORR should not be the only source of speech information.

센서는 발성부(vocal fold)의 폐쇄에 연계된 인체 조직 움직임을 감지하고, 그래서, 발성부 폐쇄에 의해 생성되는 음향 신호가 폐쇄에 크게 상관된다. 따라서, 음향 신호에 크게 상관된 센서 데이터가 스피치로 분류되고, 크게 상관되지 않은 센서 데이터는 잡음으로 분류된다. 음속이 비교적 느리기 때문에(330m/s) 발생하는 지연 시간의 결과로 음향 데이터는 0.1~0.8 밀리초(또는 약 1~7개의 샘플)만큼 센서 데이터 뒤에 처질 것으로 예상된다. 그러나, 음향파 파형이 생성되는 소리에 크게 좌우되어 변하기 때문에 한 실시예는 15 샘플 상관을 이용하며, 감지를 보장하기 위해선 더 큰 상관 폭이 필요하다.The sensor senses human tissue movement associated with the closure of the vocal fold, so that the acoustic signal produced by the vocal closure is highly correlated to the closure. Therefore, sensor data highly correlated to the acoustic signal is classified as speech, and sensor data not highly correlated is classified as noise. As a result of the delay time caused by the relatively slow sound speed (330m / s), acoustic data is expected to lag behind sensor data by 0.1-0.8 milliseconds (or about 1-7 samples). However, one embodiment uses a 15 sample correlation because the acoustic wave waveform varies greatly depending on the sound produced, and a larger correlation width is needed to ensure detection.

표준편차 및 XCORR 신호가 관련되지만, 유성음 스피치 감지가 신뢰도가 높도록 이들은 서로 충분히 다르다. 그럼에도 불구하고 단순화를 위해, 둘 중 어떤 매개변수도 사용될 수 있다. 표준편차 및 XCORR에 대한 값들은 실험치 한도와 비교되고, 두 값이 한도보다 클 경우, 유성음 스피치로 분류된다. 일례의 데이터가 제시되며 아래에 소개된다.Although standard deviation and XCORR signals are involved, they differ enough from each other so that voiced speech detection is reliable. Nevertheless, for simplicity, either parameter can be used. The values for standard deviation and XCORR are compared with the experimental limit, and if the two values are greater than the limit, they are classified as voiced speech. An example data is presented and introduced below.

도 5A, 5B와 6은 한 실시예에 따른 어구 "팝 판(pop pan)"을 대상이 두 번 소리내는 예의 그래프이다. 도 5A는 이 발화(utterance)에 대한 수신 GEMS 신호(502)를, GEMS 신호와 Mic1 신호간 평균 상관(504)과, 유성음 스피치 감지에 사용되는 한도 T1과 함께 도시하는 그래프이다. 도 5B는 이 발화에 대한 수신 GEMS 신호(502)를, GEMS 신호의 표준 편차(506)와, 유성음 스피치 감지에 사용된 한도 T2와 함께 도시하는 그래프이다. 도 6은 음향 또는 오디오 신호(608)로부터 감지되는 유성음 스피치(602)를 GEMS 신호(604) 및 음향 잡음(606)과 함께 도시하는 그래프이다. 이때, 엄청난 배경 혼선 잡음(606)으로 인해 본 예에서 어떤 무성음 스피치도 감지되지 않는다. 한도들은 오류 음의 값이 전혀 없도록 그리고 오류 양의 값만 간헐적으로 존재하도록 설정된다. 어떤 음향 배경 잡음 조건에서도 99% 이상의 유성음 스피치 활동 감지 정확도를 얻는다.5A, 5B, and 6 are graphs of examples in which the subject sounds twice the phrase “pop pan” according to one embodiment. FIG. 5A is a graph showing the received GEMS signal 502 for this utterance along with the average correlation 504 between the GEMS signal and the Mic1 signal and the limit T1 used for voiced speech detection. 5B is a graph showing the received GEMS signal 502 for this utterance, along with the standard deviation 506 of the GEMS signal, and the limit T2 used for voiced speech detection. FIG. 6 is a graph showing voiced speech 602 sensed from acoustic or audio signal 608 along with GEMS signal 604 and acoustic noise 606. At this time, no unvoiced speech is detected in this example due to the enormous background crosstalk noise 606. The limits are set such that there is no error negative value at all and only an error positive value exists intermittently. Over 99% voiced speech activity detection accuracy is achieved under any acoustic background noise condition.

NAVSAD는 비음향 센서 데이터로 인한 높은 수준의 정확도로 유성음 스피치가 발생되고 있는 시기를 결정할 수 있다. 그러나, 센서는 잡음으로부터 무성음 스피치를 분리하는 데 거의 도움이 되지 않는다. 왜냐하면, 무성음 스피치가 대부분의비음향 센서에서 감지가능한 신호를 유발시키지 않기 때문이다. 감지가능한 신호가 있을 경우 NAVSAD가 사용될 수 있지만, 무성음 스피치의 상관 정도가 불량하기 때문에 표준 편차 방법의 이용이 지시된다. 감지가능한 신호가 없을 경우, 무성음 스피치가 발생하고 있는 시기를 결정함에 있어 패스파인더 잡음 제거 알고리즘의 시스템 및 방법을 이용한다. 패스파인더 알고리즘이 간단하게 아래에 소개되며, 상세한 설명은 관련 장치 단락에서 제공된다.NAVSAD can determine when voiced speech is occurring with a high degree of accuracy due to non-acoustic sensor data. However, sensors rarely help in separating unvoiced speech from noise. This is because unvoiced speech does not cause a detectable signal in most non-acoustic sensors. NAVSAD can be used if there is a detectable signal, but the use of the standard deviation method is indicated because the degree of correlation of unvoiced speech is poor. In the absence of a detectable signal, the system and method of the pathfinder noise cancellation algorithm is used to determine when unvoiced speech is occurring. The pathfinder algorithm is briefly introduced below, and a detailed description is provided in the relevant device section.

도 3을 참고할 때, 마이크로폰(1)에 유입되는 음향 정보는 m1(n)으로 표시되고, 마이크로폰(2)에 유입되는 정보는 m2(n)으로 표시되며, GEMS 센서는 유성음 스피치 영역을 결정하는 데 가용한 것으로 가정된다. z(디지털 주파수) 도메인에서, 이 신호들은 M1(z)와 M2(z)로 표시된다. 이어서,Referring to FIG. 3, the acoustic information flowing into the microphone 1 is represented by m1 (n), the information flowing into the microphone 2 is represented by m2 (n), and the GEMS sensor determines the voiced speech area. It is assumed to be available. In the z (digital frequency) domain, these signals are represented by M1 (z) and M2 (z). next,

M1(z) = S(z) + N2(Z)M1 (z) = S (z) + N2 (Z)

M2(z) = N(z) + S2(Z)M2 (z) = N (z) + S2 (Z)

이때, N2(z) = N(z)H1(z)Where N2 (z) = N (z) H1 (z)

S2(z) = S(z)H2(z)S2 (z) = S (z) H2 (z)

따라서,M1(z) = S(z) + N(z)H1(z)Thus M1 (z) = S (z) + N (z) H1 (z)

N2(z) = N(z) + S(z)H2(z)(1)N2 (z) = N (z) + S (z) H2 (z) (1)

이는 2-마이크로폰 시스템에 대한 일반적인 경우이다. Mic1 으로 일부 잡음 누출이 있고, Mic2로 일부 신호 누출이 항상 있다. 방정식 1은 네 개의 미지수와 두개의 관계만을 자기며, 따라서 해를 얻을 수 없다.This is a common case for a two-microphone system. There is some noise leakage with Mic1 and some signal leakage with Mic2. Equation 1 has only four unknowns and two relationships, so no solution can be obtained.

그러나, 방정식 1의 미지수 일부의 해를 얻는 방식이 또한가지 있다. 신호가발생되고 있지 않은 경우를 조사해보자. 즉, GEMS 신호가 발성이 발생하지 않고 있음을 표시할 때를 고려해보자. 이 경우에, s(n) = S(z) = 0이고, 방정식 1은However, there are also ways to obtain solutions to some unknowns in equation 1. Investigate the case where no signal is being generated. In other words, consider when the GEMS signal indicates that no speech is occurring. In this case, s (n) = S (z) = 0 and equation 1 is

M1n(z) = N(z)H1(z)M1n (z) = N (z) H1 (z)

N2n(z) = N(z)N2n (z) = N (z)

로 축소된다. 이때, M 변수에서의 n 첨자는 잡음만이 수신되고 있음을 표시한다. 이로 인해,Is reduced to. At this time, the n subscript in the M variable indicates that only noise is being received. Because of this,

M1n(z) = M2n(z)H1(z)M1n (z) = M2n (z) H1 (z)

H1(z) = M1n(z)/M2n(z)(2)H1 (z) = M1n (z) / M2n (z) (2)

H1(z) 는 잡음만이 수신되고 있을 때 마이크로폰 출력과 가용 시스템 식별 알고리즘을 이용하여 연산될 수 있다. 이 연산은 적응식으로 이루어질 수 있어서, 잡음이 크게 변화할 경우, H1(z)가 신속하게 재연산될 수 있다.H1 (z) can be computed using the microphone output and available system identification algorithm when only noise is being received. This operation can be made adaptively so that H1 (z) can be quickly recomputed if the noise changes significantly.

방정식 1의 미지수 중 하나에 대한 해로, GEMS나 유사장치의 진폭을 두 마이크로폰의 진폭과 함께 이용하여, 또다른 값 H2(z)에 대한 해를 구할 수 있다. GEMS가 발성을 표시하지만 마이크로폰의 최근(1초 미만) 히스토리가 낮은 수준의 잡음을 표시하면, n(s) = N(z) ~ 0이라고 가정한다. 방정식 1은 다음과 같이 축소될 수 있다.As a solution to one of the unknowns in Equation 1, the amplitude of GEMS or similar device, along with the amplitudes of the two microphones, can be solved for another value H2 (z). If GEMS shows utterance but the recent (less than 1 second) history of the microphone shows a low level of noise, it is assumed that n (s) = N (z) to 0. Equation 1 can be reduced to

M1s(z) = S(z)M1s (z) = S (z)

M2s(z) = S(z)H2(z)M2s (z) = S (z) H2 (z)

그 결과,As a result,

M2s(z) = M1s(z)H2(z)M2s (z) = M1s (z) H2 (z)

H2(z) = M2s(z)/M1s(z)H2 (z) = M2s (z) / M1s (z)

이는 H1(z) 연산값의 역이다. 그러나, 여러 다른 입력들이 사용되고 있음에 주목해야 한다.This is the inverse of the H1 (z) operation value. However, it should be noted that several different inputs are used.

H1(z)와 H2(z)의 상기 연산 후, 신호로부터 잡음 제거를 위해 이들이 사용된다. 방정식 1을 다음과 같이 다시 쓸 수 있다.After the above calculation of H1 (z) and H2 (z), they are used for noise removal from the signal. Equation 1 can be rewritten as

S(z) = M1(z) - N(z)H1(z)S (z) = M1 (z)-N (z) H1 (z)

N(z) = M2(z) - S(z)H2(z)N (z) = M2 (z)-S (z) H2 (z)

S(z) = M1(z) - [M2(z) - S(z)H2(z)]H1(z)S (z) = M1 (z)-[M2 (z)-S (z) H2 (z)] H1 (z)

S(z)[1-H2(z)H1(z)] = M1(z) - M2(z)H1(z)S (z) [1-H2 (z) H1 (z)] = M1 (z)-M2 (z) H1 (z)

S(z)에 대하여 풀면 아래의 결과를 얻을 수 있다.Solving for S (z) gives the following results:

S(z) = (M1(z) - M2(z)H1(z)) / (1 - H2(z)H1(z))S (z) = (M1 (z)-M2 (z) H1 (z)) / (1-H2 (z) H1 (z))

실제로, H2(z)는 매우 작기 때문에, H2(z)H1(z) << 0, 따라서,In fact, since H2 (z) is very small, H2 (z) H1 (z) << 0, thus,

S(z) = M1(z) - M2(z)H1(z)S (z) = M1 (z)-M2 (z) H1 (z)

도 2와 도 3을 참고하여 PSAD 시스템이 설명된다. 음파가 전파할 때, 음파는 회절과 산란으로 인해 에너지를 잃는다. 음파가 점소스로부터 발생하여 등방성으로 방사된다고 가정하면, 그 진폭은 1/r의 함수로 감소할 것이며, 이때, r은 점소스로부터의 거리이다. 작은 면적에 제한될 경우 축소가 작아질 것이므로, 진폭에 1/r로 비례하는 이 함수는 최악의 경우이다. 그러나, 이는 사용자의 머리 위 어딘가에 위치하는 마이크로폰으로의 잡음 및 스피치 전파(propagation) 구조에 대한 적절한 모델이다.The PSAD system is described with reference to FIGS. 2 and 3. As sound waves propagate, they lose energy due to diffraction and scattering. Assuming that sound waves originate from the point source and radiate isotropically, its amplitude will decrease as a function of 1 / r, where r is the distance from the point source. Since the reduction will be small if limited to a small area, this function proportional to 1 / r of amplitude is the worst case. However, this is a suitable model for the structure of noise and speech propagation into a microphone somewhere above the user's head.

도 7은 PSAD 시스템의 한 실시예에 따르는 마이크로폰 어레이의 도면이다. 입과 선형 어레이로 마이크로폰 Mic1, Mic2를 어레이 중간선 상에 위치시킴으로서, Mic1과 Mic2의 신호 강도차는 d1과 Δd에 비례할 것이다. 1/r 관계(또는 이 경우에 1/d)를 가정할 때, 다음의 결과를 얻을 수 있다.7 is a diagram of a microphone array according to one embodiment of a PSAD system. By placing the microphones Mic1, Mic2 on the array midline with the mouth and linear array, the signal strength difference between Mic1 and Mic2 will be proportional to d1 and Δd. Assuming a 1 / r relationship (or 1 / d in this case), the following results can be obtained.

ΔM = |Mic1| / |Mic2| = ΔH1(z) ∝ (d1+Δd)/d1ΔM = | Mic1 | / | Mic2 | = ΔH1 (z) ∝ (d1 + Δd) / d1

이때, ΔM은 Mic1과 Mic2간 이득차이며, 따라서, 방정식 2의 H1(z)이다. 변수 d1은 Mic1으로부터 스피치나 잡음 소스까지의 거리이다.DELTA M is a gain difference between Mic1 and Mic2 and, therefore, H1 (z) in equation (2). The variable d1 is the distance from Mic1 to the speech or noise source.

도 8은 발명의 한 실시예에 따르는 여러 Δd 값에 대한 d1 대 ΔM의 그래프(800)이다. Δ가 커지고 잡음 소스가 가까워질수록 ΔM이 커진다. 변수 Δd는 어레이 중간선 상의 최대값으로부터 어레이 중간선에 수직인 0까지 스피치/잡음 소스에 대한 방위각에 따라 변화할 것이다. Δd가 작고 대략 30cm 이상의 거리에 대하여, ΔM이 단위값에 가깝다는 것을 그래프(800)로부터 알 수 있다. 대부분의 잡음 소스가 30cm보다 멀리 위치하고 어레이의 중간선 상에 놓이지 않기 때문에, H1(z)를 방정식 2에서처럼 연산할 때, ΔM(또는 H1(z)의 이득)은 단위값에 가까울 것이다. 역으로, 가까운 잡음 소스의 경우(몇 센티미터 이내), 마이크로폰이 잡음에 더 가까운 지에 따라 이득 차이가 나타날 수 있다.8 is a graph 800 of d1 versus ΔM for various Δd values according to one embodiment of the invention. The larger Δ and the closer the noise source, the larger ΔM. The variable Δd will vary with the azimuth angle for the speech / noise source from the maximum on the array midline to zero perpendicular to the array midline. It can be seen from the graph 800 that ΔM is close to the unit value for a distance of Δd being small and approximately 30 cm or more. Since most noise sources are located farther than 30 cm and do not lie on the midline of the array, when calculating H1 (z) as in Equation 2, ΔM (or gain of H1 (z)) will be close to unit values. Conversely, for a nearby noise source (within a few centimeters), the gain difference may appear depending on whether the microphone is closer to noise.

"잡음"이 화자의 스피치이고 Mic1이 Mic2보다 입에 가깝다면, 이득이 증가한다. 환경적 잡음은 통상적으로 스피치보다 사용자 머리에서부터 더 멀리서 발원하기 때문에, 잡음은 H1(z)의 이득이 단위값이나 어떤 고정값 근처일 때 발견될 것이며, 스피치는 이득의 날카로운 증가 후 발견될 수 있다. 스피치는 주변 잡음에 비해 충분한 볼륨을 가지는 한 무성음이거나 유성음일 수 있다. 이득은 스피치 부분동안 어느정도 높게 유지될 것이며, 스피치가 중단된 후 급속하게 감소된다. H1(z) 이득의 급속한 증가 및 감소는 거의 어떤 상황 하에서도 스피치를 감지할만큼 충분하여야 한다. 본 예의 이득은 필터 계수의 절대값의 합에 의해 연산된다. 이 합은 이득과 같지 않으나, 두 값은 절대값 합의 상승이 이득의 상승을 반영하는 관계를 가진다.If "noise" is the speaker's speech and Mic1 is closer to the mouth than Mic2, the gain increases. Since environmental noise typically originates farther from the user's head than speech, noise will be found when the gain of H1 (z) is near unit or some fixed value, and speech can be found after a sharp increase in gain. . Speech can be unvoiced or voiced as long as it has sufficient volume relative to ambient noise. The gain will remain somewhat high during the speech portion, and then decreases rapidly after the speech is stopped. The rapid increase and decrease of the H1 (z) gain should be sufficient to detect speech under almost any situation. The gain of this example is calculated by the sum of the absolute values of the filter coefficients. This sum is not equal to gain, but the two values have a relationship in which the increase in the sum of the absolute values reflects the increase in gain.

이 거동의 한 예로서, 도 9는 H1(z)의 절대값의 합으로 이득 매개변수(902)와, 마이크로폰(1)으로부터의 음향 데이터(904)나 오디오의 그래프(900)를 도시한다. 스피치 신호는 어구 "팝 판(pop pan)"을 두 번 반복한 경우의 발화(utterance)이다. 대역폭은 2500~3500 Hz의 주파수 범위를 포함하며, 1500~2500 Hz의 대역폭이 추가적으로 사용되었다. 무성음 스피치가 먼저 나타날 경우 이득이 급속하게 증가하였고, 스피치가 종료될 때 정상값으로 신속하게 회귀하였다. 잡음 및 스피치 간의 전이로부터 생기는 상당한 이득 변화는 표준 신호 처리 기술에 의해 감지될 수 있다. 마지막 몇가지 이득 연산의 표준 편차가 사용되며, 표준 편차 잡음 플로어(standard deviation noise floor)와 표준 편차의 동작 평균에 의해 한도가 규정된다. 유성음 스피치에 대한 나중의 이득 변화는 단순화를 위해 이 그래프(900)에서 억제된다.As an example of this behavior, FIG. 9 shows the gain parameter 902 and the acoustic data 904 from the microphone 1 or the graph 900 of audio as the sum of the absolute values of H1 (z). The speech signal is utterance when the phrase "pop pan" is repeated twice. The bandwidth covers a frequency range of 2500 to 3500 Hz, with an additional bandwidth of 1500 to 2500 Hz. The gain increased rapidly when unvoiced speech appeared first, and then quickly returned to normal value when speech terminated. Significant gain changes resulting from transitions between noise and speech can be detected by standard signal processing techniques. The standard deviations of the last few gain operations are used and the limits are defined by the standard deviation noise floor and the operating average of the standard deviations. Later gain variations for voiced speech are suppressed in this graph 900 for simplicity.

도 10은 도 9에 제시된 음향 데이터의 대안의 그래프(1000)이다. 그래프(900) 형성에 사용되는 데이터는 이 그래프(1000)에 다시 제시되며, 무성음 스피치가 나타나도록 잡음없이 음향 데이터(1004)와 GEMS 데이터(1006)과 함께 제시된다. 발성 신호(1002)는 세가지 값을 가질 수 있다. 잡음은 0, 무성음은 1, 유성음은 2이다. 잡음제거는 V=0일 때만 달성된다. 각각의 "팝(pop)"의 끝 근처에서 무성음 감지의 두 단일 드롭아웃과는 별도로, 무성음 스피치가 쉽게 캡처된다는 것은 확실하다. 그러나, 이 단일-윈도 드롭아웃은 흔하지 않으며, 잡음제거 알고리즘에 크게 영향을 미치지 않는다. 이들은 표준 스무딩(standard smoothing) 기술을 이용하여 쉽게 제거될 수 있다.FIG. 10 is an alternative graph 1000 of acoustic data shown in FIG. 9. The data used to form the graph 900 is presented again in this graph 1000 and presented together with the acoustic data 1004 and GEMS data 1006 without noise so that unvoiced speech appears. The speech signal 1002 may have three values. Noise is 0, voiceless is 1, voiced sound is 2. Noise rejection is only achieved when V = 0. Apart from the two single dropouts of unvoiced detection near the end of each "pop", it is clear that unvoiced speech is easily captured. However, this single-window dropout is not common and does not significantly affect the noise cancellation algorithm. These can be easily removed using standard smoothing techniques.

그래프(1000)로부터 분명하지 않은 것은 PSAD 시스템이 NAVSAD 시스템에 대한 자동 백업으로 기능하는 점이다. 이는 센서나 NAVSAD 시스템이 어떤 이유로 실패할 경우 유성음 스피치(마이크에 대해 무성음과 동일한 관계를 가지기 때문에)가 무성음으로 감지될 것이기 때문이다. 유성음 스피치는 무성음으로 오분류될 것이나, 잡음제거는 여전히 일어나지 않아 스피치 신호의 품질을 보존할 것이다.What is not clear from the graph 1000 is that the PSAD system functions as an automatic backup for the NAVSAD system. This is because if the sensor or NAVSAD system fails for some reason, voiced speech (since it has the same relationship as unvoiced to the microphone) will be detected as unvoiced. Voiced speech will be misclassified as unvoiced, but noise cancellation will still not occur, preserving the quality of the speech signal.

그러나, NAVSAD 시스템이 이 자동 백업은 저잡음 환경(대략 10+dB SNR)에서 최적으로 기능한다. 왜냐하면, 높은 양의 음향 잡음(10dB SNR 이하)이 PSAD를 포함한 어떤 음향-전용 무성음 감지기(acoustic only unvoiced detector)를 바로 압도하기 때문이다. 이는 도 6과 10의 그래프(600, 1000)에 도시되는 발성 신호 데이터(602, 1002)의 차이에서 명백하며, 이때, 동일한 발화(utterance)가 일어나지만 그래프(600) 데이터는 어떤 무성음 스피치도 보여주지 않는다. 왜냐하면, 무성음 스피치를 감지할 수 없기 때문이다. 이는 잡음제거 실행시 요망 거동이다. 무성음 스피치가 감지될 수 없을 경우 잡음제거 과정에 크게 영향을 미치지 못할 것이기 때문이다. 패스파인더 시스템을 이용하여 무성음 스피치를 감지하는 것은 잡음제거를 왜곡시킬만큼 충분히 큰 무성음 스피치의 감지를 보장한다.However, for NAVSAD systems, this automatic backup works best in low noise environments (approximately 10 + dB SNR). This is because high amounts of acoustic noise (less than 10 dB SNR) directly overwhelm any acoustic-only unvoiced detector, including the PSAD. This is evident in the difference between the vocal signal data 602 and 1002 shown in the graphs 600 and 1000 of FIGS. 6 and 10, where the same utterance occurs but the graph 600 data shows no unvoiced speech. Do not give. Because you can't detect unvoiced speech. This is the desired behavior when running noise reduction. If unvoiced speech cannot be detected, it will not significantly affect the noise rejection process. Detecting unvoiced speech using a Pathfinder system ensures the detection of unvoiced speech large enough to distort noise rejection.

도 7을 참고하여 하드웨어 고려사항을 살펴보면, 마이크로폰의 구조는 스피치 감지에 필요한 한도와 스피치에 관련된 이득 변화에 대한 효과를 가질 수 있다. 일반적으로 각각의 구조는 적절한 한도 결정을 위한 테스트를 필요로할 것이나, 두개의 매우 다른 마이크로폰 구조를 이용한 테스트는 동일한 한도와, 다른 매개변수로 정상적으로 작업이 이루어짐을 보여주었다. 제 1 마이크로폰 세트는 귀에서 몇센티미터 떨어진 잡음 마이크로폰과 입근처의 신호 마이크로폰을 가진다. 또한 제 2 구조는 입에서 몇센티미터 내에 등을 맞댄 잡음 및 신호 마이크로폰을 위치시킨다. 여기서 제시되는 결과는 제 1 마이크로폰 구조를 이용하여 도출되었으나, 나머지 세트를 이용한 결과도 실질적으로 동일하여, 마이크로폰 배치에 대해 감지 알고리즘이 상대적으로 견고하다.Looking at the hardware considerations with reference to Figure 7, the structure of the microphone can have an effect on the gain required for speech detection and the limits related to speech. In general, each structure will require testing to determine the appropriate limits, but tests using two very different microphone structures have shown that work is normally done with the same limits and different parameters. The first set of microphones has a noise microphone a few centimeters away from the ear and a nearby signal microphone. The second structure also places the noise and signal microphones back within a few centimeters at the mouth. The results presented here were derived using the first microphone structure, but the results using the remaining sets are also substantially the same, so that the detection algorithm is relatively robust for microphone placement.

유성음 및 무성음 스피치를 감지하기 위해 NAVSAD 및 PSAD 시스템을 이용하여 다수의 구조가 가능하다. 한개의 구조는 무성음 스피치 감지를 위한 PSAD 시스템과 함께 유성음 스피치 감지를 위한 NAVSAD 시스템을 이용한다. PSAD는 유성음 스피치 감지를 위한 NAVSAD 시스템에 대한 백업으로도 기능한다. 대안의 구조는 무성음 스피치 감지를 위한 PSAD 시스템과 함께 유성음 스피치 감지를 위한 NAVSAD 시스템을 이용한다. PSAD는 유성음 스피치 감지를 위한 NAVSAD 시스템에 대한 백업으로도 기능한다. 또하나의 대안의 구조는 PSAD 시스템을 이용하여 유성음 및 무성음 스피치를 감지할 수 있다.Multiple structures are possible using the NAVSAD and PSAD systems to detect voiced and unvoiced speech. One architecture uses the NAVSAD system for voiced speech detection along with the PSAD system for unvoiced speech detection. PSAD also acts as a backup to the NAVSAD system for voiced speech detection. An alternative architecture uses a NAVSAD system for voiced speech detection in conjunction with a PSAD system for unvoiced speech detection. PSAD also acts as a backup to the NAVSAD system for voiced speech detection. Another alternative structure can detect voiced and unvoiced speech using a PSAD system.

상술한 시스템들이 배경 음향 잡음으로부터 유성음 및 무성음 스피치를 구분하는 것을 들어 설명되었으나, 보다 복잡한 분류가 이루어지지 못할 이유는 아무것도없다. 스피치를 좀 더 깊이있게 특성화할 때, 시스템은 Mic1과 Mic2로부터의 정보를 대역통과시킬 수 있어서, Mic1 데이터의 어느 대역이 보다 많은 잡음으로 구성되고 어느 대역이 보다 많은 스피치로 구성되는 지를 알 수 있다. 이 지식을 이용하여, 기존 음향 방법과 유사한 스펙트럼 특성에 의해 발화를 그룹형성하는 것이 가능하다. 이 방법은 잡음 환경에서 더 잘 동작한다.Although the above systems have been described listening to distinguish voiced and unvoiced speech from background acoustic noise, there is no reason why a more complex classification cannot be made. When you characterize speech more deeply, the system can bandpass information from Mic1 and Mic2 so that you can see which band of Mic1 data consists of more noise and which band consists of more speech. . Using this knowledge, it is possible to group utterances by spectral characteristics similar to existing acoustic methods. This method works better in noisy environments.

한 예로, "킥(kick)"의 "케이(k)"는 500~4000 Hz 사이의 주파수 콘텐트를 가지지만, "she"의 "sh"는 1700~4000 Hz로부터의 에너지만을 가진다. 유성음 스피치도 유사한 방식으로 분류될 수 있다. 가령, /i/("ee")는 300~2500 Hz 의 에너지를 가지며, /a/("ah")는 900~1200 Hz의 에너지를 가진다. 따라서, 잡음이 존재하는 하에서 유성음 및 무성음 스피치를 구분하는 이러한 능력은 매우 유용하다.As an example, "k" of "kick" has a frequency content between 500 and 4000 Hz, while "sh" of "she" has only energy from 1700-4000 Hz. Voiced speech can also be classified in a similar manner. For example, / i / ("ee") has an energy of 300-2500 Hz and / a / ("ah") has an energy of 900-1200 Hz. Thus, this ability to distinguish between voiced and unvoiced speech in the presence of noise is very useful.

여기서 제시되는 순서도에 묘사된 단계들 각각은 여기서 소개될 필요가 없는 일련의 동작들을 자체적으로 포함할 수 있다. 당 분야의 통상의 지식을 가진 자라면, 루틴, 알고리즘, 소스 코드, 마이크로코드, 프로그램 로직 어레이를 생성할 수 있을 것이며, 여기서 소개된 순서도와 상세한 설명을 바탕으로 발명을 구현할 수 있을 것이다. 여기서 소개되는 루틴들에는 다음 중 한가지 이상이, 또는 다음 중 한가지 이상의 조합이 제공될 수 있다. 즉, 관련 프로세서의 일부를 형성하는 비휘발성 메모리에 저장되고, 또는 기존 프로그램식 로직 어레이나 회로 요소를 이용하여 구현되며, 또는 디스크같은 탈착식 매체에 저장되고, 또는 서버로부터 다운로드되어 클라이언트에 국부적으로 저장되며, 또는 EEPROM 반도체 칩, ASIC, 또는 DSP집적 회로같은 칩에 배선되거나 사전프로그래밍되는, 이들 중 한가지 이상, 또는 이들 중 한가지 이상의 조합이 제공될 수 있다.Each of the steps depicted in the flowchart presented herein may itself include a series of actions that need not be introduced herein. Those skilled in the art will be able to create routines, algorithms, source code, microcode, program logic arrays, and implement the invention based on the flowcharts and detailed description presented herein. Routines introduced herein may be provided with one or more of the following, or a combination of one or more of the following: That is, stored in non-volatile memory that forms part of the associated processor, or implemented using existing programmable logic arrays or circuit elements, or stored on removable media, such as disk, or downloaded from a server and stored locally on the client. One or more of these, or a combination of one or more of these, may be provided, wired or preprogrammed to a chip such as an EEPROM semiconductor chip, an ASIC, or a DSP integrated circuit.

여기서 소개되는 정보는 공지되어 있거나 관련 장치 단락에 상세하게 설명되어 있다. 게다가, 여기서 제공되는 상세한 설명의 상당부분은 관련 장치 단락에 명백하게 공개되어 있다. 발명의 태양의 추가적 사항들 대부분이나 모두가 이러한 관련 장치 단락에서 제공되는 상세한 설명에 내재된 것으로 당 분야의 통상의 지식을 가진 자에게 이해될 것이며, 또는 당 분야에 공지된 것으로 인식될 것이다. 당 분야의 통상의 지식을 가진 자는 관련 장치에 제공되는 상세한 설명과 여기서 제시되는 사항을 바탕으로 발명의 태양들을 구현할 수 있다.The information introduced here is known or described in detail in the relevant apparatus paragraph. In addition, much of the description provided herein is explicitly disclosed in the relevant apparatus paragraphs. Many or all of the additional aspects of the invention are to be understood by one of ordinary skill in the art as would be inherent in the detailed description provided in this relevant apparatus paragraph, or will be recognized as known in the art. Those skilled in the art can implement aspects of the invention based on the detailed description provided herein and the matter presented herein.

여기서 제공되는 발명의 가르침은 상술한 스피치 신호 처리에만 국한되는 것이 아니라 신호 처리 시스템에 적용될 수 있다. 더욱이, 상술한 여러 실시예들의 요소들과 단계들이 조합되어 추가적인 실시예들을 제공할 수 있다.The teachings of the invention provided herein are not limited to the speech signal processing described above, but may be applied to signal processing systems. Moreover, the elements and steps of the various embodiments described above can be combined to provide further embodiments.

Claims

A system for detecting voiced and unvoiced speech in an acoustic signal with varying levels of background noise.

At least two microphones for receiving acoustic signals, and

One or more processors coupled to the microphones

Including, the one or more processors,

Generate a difference parameter between the acoustic signals received at each of the two microphones, wherein the difference parameter represents the relative difference in signal gain for a portion of the received acoustic signal,

-Classify the acoustic signal information into unvoiced speech when the difference parameter exceeds the first limit, and

To classify the acoustic signal information into voiced speech when the difference parameter exceeds the second limit,

System for detecting voiced and unvoiced speech, characterized in that.

As a method for detecting voiced and unvoiced speech in an acoustic signal having various levels of background noise,

Receiving acoustic signals at two receivers,

Generating a difference parameter between acoustic signals received at each of the two receivers, where the difference parameter represents the relative difference in signal gain for a portion of the received acoustic signal,

Classify the acoustic signal information into unvoiced speech when the difference parameter exceeds the first limit, and

Classifying acoustic signal information into voiced speech when the difference parameter exceeds a second limit,

Method for detecting voiced and unvoiced speech comprising the above steps.

The method of claim 2,

Generating the first and second limits using a standard deviation corresponding to the occurrence of the difference parameter.

And further comprising a step.

The method of claim 2,

Classify the acoustic signal information as noise when the difference parameter is less than the first limit, and

To remove noise for the classified noise,

And further comprising a step.

The method of claim 2, wherein the method is

To receive physiological information related to human speech activity

Wherein the physiological information comprises one or more detectors selected from RF devices, electroglottographs, ultrasound devices, acoustic throat micorphones, and airflow detectors. And receiving physiological data related to human speech.

A system for detecting voiced and unvoiced speech in an acoustic signal having several levels of background noise, the system comprising:

At least two microphones for receiving acoustic signals,

One or more speech sensors that receive physiological information related to human speech activity, and

At least one processor coupled to the microphone and the voice sensor

Includes, wherein the one or more processors,

Generate cross-correlation data between physiological information and acoustic signals received at one of the two microphones,

-When the cross-correlation data corresponding to a part of the acoustic signal received from one receiver exceeds the correlation limit, the acoustic signal information is classified into voiced speech,

Generate a difference parameter between the acoustic signals received at each of the two receivers, where the difference parameter represents the relative difference in signal gain for a portion of the received acoustic signal,

-Classify the acoustic signal information into unvoiced speech when the difference parameter exceeds the gain limit, and

Classifying acoustic signal information as noise when the difference parameter is less than the gain limit,

System for detecting voiced and unvoiced speech, characterized in that.

As a noise reduction method of an acoustic signal, this method

Receiving acoustic signals at both receivers, receiving physiological information related to human speech activity at the speech sensor,

Generate cross-correlation data between the acoustic signal received at one of the two receivers and the physiological information,

Noise canceling method of an acoustic signal comprising the above steps.