KR101339578B1

KR101339578B1 - Method and apparatus for recognizing speech

Info

Publication number: KR101339578B1
Application number: KR1020120050706A
Authority: KR
Inventors: 고한석; 박진수; 문성규
Original assignee: 고려대학교 산학협력단
Priority date: 2012-05-14
Filing date: 2012-05-14
Publication date: 2013-12-10
Also published as: KR20130127077A

Abstract

본 발명은 잡음과 음성이 혼합된 환경에서 주파수 변환 및 원점 대칭 필터를 이용하여 음성 구간을 검출하는 음성 인식 방법 및 장치에 관한 것이다.
본 발명의 일실시예에 의한 음성 인식 방법은 잡음과 음성 신호가 혼합된 시간 영역의 오디오 입력 신호를 수신하는 단계; 상기 시간 영역의 오디오 입력 신호를 주파수 영역의 신호로 변환하는 단계; 음성 부재 확률에 대한 음성 존재 확률의 비를 이용하여 상기 오디오 입력 신호의 특징값을 추출하는 단계-상기 음성 존재 확률 및 음성 부재 확률은 상기 변환된 주파수 영역의 신호를 이용하여 획득됨-; 상기 추출된 특징값에 원점 대칭 필터를 적용하여 음성 구간을 검출하는 단계를 포함할 수 있다.The present invention relates to a speech recognition method and apparatus for detecting a speech section using a frequency transform and an origin symmetric filter in a mixed environment of noise and speech.
Speech recognition method according to an embodiment of the present invention comprises the steps of receiving an audio input signal of the time domain mixed with noise and voice signal; Converting the audio input signal in the time domain into a signal in the frequency domain; Extracting feature values of the audio input signal using a ratio of speech presence probability to speech absence probability, wherein the speech presence probability and speech absence probability are obtained using the transformed frequency domain signal; The method may include detecting a speech section by applying an origin symmetric filter to the extracted feature value.

Description

Speech recognition method and apparatus {METHOD AND APPARATUS FOR RECOGNIZING SPEECH}

본 발명은 음성 인식 방법 및 장치에 관한 것으로, 보다 상세하게는 잡음과 음성이 혼합된 환경에서 주파수 변환 및 원점 대칭 필터를 이용하여 음성 구간을 검출하는 음성 인식 방법 및 장치에 관한 것이다.The present invention relates to a speech recognition method and apparatus, and more particularly, to a speech recognition method and apparatus for detecting a speech section by using a frequency conversion and the origin symmetric filter in a mixed environment of noise and speech.

일반적으로 음성 인식(Speech Recognition)이란 사람이 말하는 음성 언어를 컴퓨터가 해석해 처리하는 기술을 의미한다. 이러한 음성 인식 기술은 주변의 잡음이 적은 실험실, 가정, 또는 사무실 등에서 사용되면 성능이 우수하게 느껴지나, 주변의 잡음이 심한 도로, 복도, 전시장, 회의장 등에서 사용되는 경우에는 인식률이 현저히 떨어지게 된다. 이는 주변 잡음과 사람의 음성을 효과적으로 분리하지 못하기 때문이다. Generally speaking, speech recognition refers to a technology in which a computer interprets and processes a speech language spoken by a person. Such a speech recognition technology may perform well when used in a laboratory, home, or office where noise is low, but when used in a noisy road, corridor, exhibition hall, or conference hall, the recognition rate may be significantly lowered. This is because it does not effectively separate ambient noise from human voice.

한편, 음성 인식에서의 음성구간 검출은 음성인식 성능에 큰 영향을 미치는 요소이다. 음성구간 검출을 통해 음성 구간만의 신호를 취함으로써, 음성인식에 소요되는 시간을 단축시킬 수 있으며 비음성 구간에 존재하는 잡음이 음성인식 성능을 하락시킬 수 있는 가능성을 줄일 수 있다. 그동안 다양한 음성구간 검출기들이 개발되었지만, 신호 대 잡음 비(Signal to Noise Ratio, SNR)가 -5dB 이하인 경우 대부분 좋은 성능을 보이지 않는다. 신호 대 잡음 비가 -5dB 이하의 잡음 환경에서는 사용자의 음성과 잡음이 함께 시스템의 마이크로 입력이 되어 정확한 음성 구간을 검출하기 어렵다.On the other hand, speech segment detection in speech recognition is a factor that greatly affects speech recognition performance. By taking a signal of only the speech section through the speech section detection, the time required for speech recognition can be shortened, and the noise in the non-voice section can reduce the possibility of degrading the speech recognition performance. Various speech detectors have been developed, but most of them do not perform well when the signal-to-noise ratio (SNR) is -5dB or less. In a noise environment with a signal-to-noise ratio of less than -5dB, the user's voice and noise enter the system's microphone together, making it difficult to detect the correct speech section.

따라서 잡음이 많은 환경에서도 정밀하게 음성 구간을 검출할 수 있는 기술에 대한 연구가 필요한 실정이다.Therefore, there is a need for a study on a technology capable of accurately detecting a speech section even in a noisy environment.

본 발명의 목적은 잡음이 많은 환경에서도 음성 구간을 정확하게 검출할 ㅅ수 있는 음성 인식 방법 및 장치를 제공하는 데 있다.An object of the present invention is to provide a speech recognition method and apparatus that can accurately detect a speech section even in a noisy environment.

본 발명의 목적은 주파수 영역의 신호를 이용하여 획득될 수 있는 음성 존재 확률과 음성 부재 확률의 비를 이용하여 음성 구간을 검출할 수 있는 음성 인식 방법 및 장치를 제공하는 데 있다.An object of the present invention is to provide a speech recognition method and apparatus that can detect a speech section using a ratio of a speech presence probability and a speech absence probability that can be obtained using a signal in a frequency domain.

상기 목적을 달성하기 위해 본 발명의 일실시예에 의하면, 잡음과 음성 신호가 혼합된 시간 영역의 오디오 입력 신호를 수신하는 단계; 상기 시간 영역의 오디오 입력 신호를 주파수 영역의 신호로 변환하는 단계; 음성 부재 확률에 대한 음성 존재 확률의 비를 이용하여 상기 오디오 입력 신호의 특징값을 추출하는 단계-상기 음성 존재 확률 및 음성 부재 확률은 상기 변환된 주파수 영역의 신호를 이용하여 획득됨-; 상기 특징값에 원점 대칭 필터를 적용하여 음성 구간을 검출하는 단계를 포함하는 음성 인식 방법이 제공된다.According to an embodiment of the present invention to achieve the above object, a step of receiving an audio input signal of the time domain mixed with noise and voice signals; Converting the audio input signal in the time domain into a signal in the frequency domain; Extracting feature values of the audio input signal using a ratio of speech presence probability to speech absence probability, wherein the speech presence probability and speech absence probability are obtained using the transformed frequency domain signal; There is provided a speech recognition method comprising detecting a speech section by applying an origin symmetric filter to the feature value.

상기 목적을 달성하기 위해 본 발명의 일실시예에 의하면, 잡음과 음성 신호가 혼합된 시간 영역의 오디오 입력 신호를 수신하는 수신부; 상기 시간 영역의 오디오 입력 신호를 주파수 영역의 신호로 변환하는 변환부; 음성 부재 확률에 대한 음성 존재 확률의 비를 이용하여 상기 오디오 입력 신호의 특징값을 추출하는 특징값 추출부-상기 음성 존재 확률 및 음성 부재 확률은 상기 변환된 주파수 영역의 신호를 이용하여 획득됨-; 상기 추출된 특징값에 원점 대칭 필터를 적용하여 음성 구간을 검출하는 음성 구간 검출부; 및 상기 수신부, 상기 변환부, 상기 특징값 추출부, 및 상기 음성 구간 검출부를 제어하는 제어부를 포함하는 음성 인식 장치가 제공된다.According to an embodiment of the present invention to achieve the above object, a receiver for receiving an audio input signal of the time domain mixed with a noise and voice signal; A converter converting the audio input signal in the time domain into a signal in the frequency domain; A feature value extractor for extracting a feature value of the audio input signal using a ratio of a voice presence probability to a voice absence probability, wherein the voice presence probability and the voice absence probability are obtained using the converted frequency domain signal. ; A speech section detector for detecting a speech section by applying an origin symmetric filter to the extracted feature values; And a controller configured to control the receiver, the converter, the feature value extractor, and the speech interval detector.

본 발명의 일실시예에 의한 음성 인식 방법 및 장치는 주파수 영역의 신호를 이용하여 획득될 수 있는 음성 존재 확률과 음성 부재 확률의 비를 이용하여 오디오 입력 신호의 특징값을 추출함으로써, 음성 구간의 특징값을 잡은 구간의 특징값과 구별되게 검출할 수 있다.The speech recognition method and apparatus according to an embodiment of the present invention extracts a feature value of an audio input signal by using a ratio of a speech presence probability and a speech absence probability that can be obtained using a signal in a frequency domain, It can be detected to be distinguished from the feature value of the section which caught the feature value.

또한, 본 발명의 일실시예에 의하면, 음성 존재 확률과 음성 부재 확률의 비 및 원점 대칭 필터를 이용함으로써, 잡은 환경에서도 음성 구간을 정확하게 검출 할 수 있다.In addition, according to an embodiment of the present invention, by using a ratio of a voice presence probability and a voice absence probability and an origin symmetry filter, a voice section may be accurately detected even in a captured environment.

도 1은 본 발명의 일실시예와 관련된 음성 인식 장치의 블록도이다.
도 2는 본 발명의 일실시예와 관련된 음성 인식 방법의 흐름도이다.
도 3은 본 발명의 일실시예와 관련된 주파수 변환을 통해 얻어진 오디오 입력 신호의 패턴을 나타내는 그래프이다.
도 4는 본 발명의 일실시예와 관련된 원점 대칭 필터의 일예를 나타내는 그리프이다.
도 5는 본 발명의 일실시예와 관련된 음성 구간 검출을 위한 모델을 설명하기 위한 도면이다.
도 6은 본 발명의 일실시예와 관련된 음성 인식 방법을 통해 얻어진 음성 구간 검출 결과를 나타내는 그래프이다.1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
2 is a flowchart of a speech recognition method according to an embodiment of the present invention.
3 is a graph illustrating a pattern of an audio input signal obtained through frequency conversion according to an embodiment of the present invention.
4 is a glyph illustrating an example of an origin symmetric filter associated with an embodiment of the present invention.
FIG. 5 is a diagram for describing a model for detecting a voice interval associated with an embodiment of the present invention. FIG.
6 is a graph illustrating a voice section detection result obtained through a voice recognition method according to an embodiment of the present invention.

이하, 본 발명의 일실시예와 관련된 음성 인식 방법 및 장치에에 대해 도면을 참조하여 설명하기로 하겠다.Hereinafter, a voice recognition method and apparatus related to an embodiment of the present invention will be described with reference to the accompanying drawings.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When an element is referred to as "including" an element throughout the specification, it is to be understood that the element may include other elements, without departing from the spirit or scope of the present invention. In addition, the terms "... unit", "module", etc. described in the specification mean a unit for processing at least one function or operation, which may be implemented in hardware or software or a combination of hardware and software. .

도 1은 본 발명의 일실시예와 관련된 음성 인식 장치의 블록도이다.1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.

도시된 바와 같이, 음성 인식 장치(100)는 수신부(100), 변환부(120), 특징값 추출부(130), 음성 구간 검출부(140), 및 제어부(150)를 포함할 수 있다. 그러나, 도시된 구성요소 모두가 필수구성요소인 것은 아니다. 도시된 구성요소보다 많은 구성요소로 상기 음성 인식 장치(100)가 구성될 수도 있고, 그보다 적은 구성요소에 의해서도 음성 인식 장치(100)가 구성될 수 있다. 상기 음성 인식 장치(100)는 개인용 휴대 단말기, 음성 인식 TV, 대화형 로봇, 음성 기반의 자동차 인터페이스(예: 자동차 내비게이션), 컴퓨터 등 다양한 전자 제품에 적용될 수 있다.As illustrated, the speech recognition apparatus 100 may include a receiver 100, a converter 120, a feature value extractor 130, a speech interval detector 140, and a controller 150. However, not all illustrated components are required. The speech recognition apparatus 100 may be configured with more components than the illustrated components, and the speech recognition apparatus 100 may be configured with fewer components. The speech recognition apparatus 100 may be applied to various electronic products such as a personal digital assistant, a speech recognition TV, an interactive robot, a voice-based car interface (eg, car navigation), a computer, and the like.

수신부(110)는 오디오 입력 신호를 수신할 수 있다. 상기 음성 검출을 위해 상기 음성 인식 장치로 입력되는 신호로, 잡음과 음성이 혼합된 신호일 수 있다. 예를 들어, 잡음 환경에서 사람의 음성이 상기 수신부(110)로 입력될 수 있다. 그리고 상기 오디오 입력 신호는 시간 영역의 신호일 수 있다. 시간 영역의 신호라 함은 시간 요소가 변수로 표현되는 신호를 의미한다. 예를 들어, 오디오 입력 신호는 시간축의 함수(

)로 표현될 수 있다.The receiver 110 may receive an audio input signal. The signal input to the speech recognition apparatus for the speech detection may be a signal in which noise and voice are mixed. For example, a human voice may be input to the receiver 110 in a noisy environment. The audio input signal may be a signal in a time domain. The signal in the time domain means a signal in which time elements are represented by variables. For example, an audio input signal can be a function of the time base (

Can be expressed as

변환부(120)는 상기 시간 영역의 오디오 입력 신호를 주파수 영역의 신호를 변환할 수 있다. 주파수 영역의 신호로의 변환을 위해 고속 퓨리에 변환(FFT, Fast Fourier Transform), 이산 코사인 변환(DCT, Discrete, Cosine Transform) 등의 변환 기법이 사용될 수 있다.The converter 120 may convert the audio input signal in the time domain into a signal in the frequency domain. Transform techniques such as Fast Fourier Transform (FFT) and Discrete Cosine Transform (DCT, Discrete, Cosine Transform) may be used to convert the signals in the frequency domain.

특징값 추출부(130)는 주파수 영역으로 변환된 오디오 입력 신호를 이용하여 오디오 입력 신호의 특징값을 추출할 수 있다. 상기 특징값은 오디오 입력 신호의 정보를 잘 표현할 수 있도록 수치화한 값을 말한다.The feature value extractor 130 may extract a feature value of the audio input signal using the audio input signal converted into the frequency domain. The feature value refers to a value digitized so that information of an audio input signal can be well represented.

음성 구간 검출부(140)는 상기 추출된 오디오 입력 신호의 특징값에 소정의 필터를 적용하여 음성 구간을 검출할 수 있다. 상기 소정의 필터는 값이 거의 일정한 경우 상쇄시켜 0에 가까운 결과가 나오게 할 수 있는 원점 대칭 필터를 포함할 수 있다. 상기 원점 대칭 필터가 에지 필터의 역할을 수행할 수 있다. 에지 필터는 오디오 신호 내의 에지를 검출할 수 있다. 에지는 오디오 신호 내에서 특징값의 윤곽선에 해당하는 것으로 계조가 급격하게 변화하는 경계이다.The voice section detector 140 may detect a voice section by applying a predetermined filter to the extracted feature value of the audio input signal. The predetermined filter may include an origin symmetric filter capable of canceling when the value is almost constant, resulting in a value close to zero. The origin symmetry filter may serve as an edge filter. The edge filter can detect edges in the audio signal. The edge corresponds to the contour of the feature value in the audio signal and is a boundary at which the gray level rapidly changes.

또한, 음성 구간 검출(140)는 상기 특징값에 원점 대칭 필터를 적용한 후, 설정된 상한값과 하한값을 이용하여 음성 구간의 시작 지점과 종료 지점을 검출할 수 있다.In addition, the speech section detection 140 may apply the origin symmetry filter to the feature value, and then detect the start point and the end point of the speech section by using the set upper limit value and the lower limit value.

제어부(150)는 상기 수신부(110), 상기 변환부(120), 상기 특징값 추출부(130), 및 음성 구간 검출부(140)를 전반적으로 제어할 수 있다.The controller 150 may generally control the receiver 110, the converter 120, the feature value extractor 130, and the voice interval detector 140.

도 2는 본 발명의 일실시예와 관련된 음성 인식 방법의 흐름도이다.2 is a flowchart of a speech recognition method according to an embodiment of the present invention.

수신부(110)는 오디오 입력 신호를 수신할 수 있다(S210). 상기 오디오 입력 신호는 잡음과 음성이 혼합된 신호일 수 있다. 그리고 상기 오디오 입력 신호는 시간 영역의 신호이다. 따라서 오디오 입력 신호는 다음과 같이 시간의 함수

로 표현될 수 있다.The receiver 110 may receive an audio input signal (S210). The audio input signal may be a signal in which noise and voice are mixed. The audio input signal is a signal in the time domain. Thus the audio input signal is a function of time as

It can be expressed as.

주파수 변환부(120)는 시간의 함수로 표현된 오디오 입력 신호를 주파수 영역의 신호로 변환할 수 있다(S220). 시간 영역의 오디오 입력 신호를 주파수 영역의 신호로 변화는 것은 시간축의 신호는 과거 시간과 미래 시간 모두 무한개의 데이터가 존재하여, 그 신호를 분석하려면 어려움이 있기 때문이다. The frequency converter 120 may convert the audio input signal expressed as a function of time into a signal in the frequency domain (S220). Changing the audio input signal in the time domain to the signal in the frequency domain is because the time axis signal has infinite data in both past time and future time, which makes it difficult to analyze the signal.

주파수 영역의 신호로의 변환을 위해 고속 퓨리에 변환(FFT, Fast Fourier Transform), 이산 코사인 변환(DCT, Discrete, Cosine Transform) 등의 변환 기법이 사용될 수 있다.Transform techniques such as Fast Fourier Transform (FFT) and Discrete Cosine Transform (DCT, Discrete, Cosine Transform) may be used to convert the signals in the frequency domain.

이하, 실시예에서는 고속 퓨리에 변환(FFT, Fast Fourier Transform)을 이용하여 주파수 영역으로 변환하는 방법에 대해 설명하도록 하겠다.In the following embodiment, a method of transforming into a frequency domain by using a Fast Fourier Transform (FFT) will be described.

시간 축의 오디오 입력 신호

는 수학식 1과 같이 고속 퓨리에 변환(FFT, Fast Fourier Transform)을 이용하여 주파수 축의 신호로 변환될 수 있다.Audio input signal on time axis

May be transformed into a signal on the frequency axis using a Fast Fourier Transform (FFT) as shown in Equation 1.

여기서

는 고속 퓨리에 변환 빈 인덱스를 나타내며, M은 고속 퓨리에 변환 사이즈를 나타낸다.

번째 인덱스에 대하여 수학식 1을

라 하고,

는

번째 주파수 인덱스에서의 절대값의 크기를 나타낸다. 그 다음 첫 번째 고속 퓨리에 변환을 통한 절대값 크기 (magnitude)를 시간축이라고 가정하여 두 번째 고속 퓨리에 변환

을 수학식 2와 같이 구할 수 있다. 또한, j 는 복소수 및 N은 신호의 주기를 나타내다.
here

Denotes a fast Fourier transform bin index, and M denotes a fast Fourier transform size.

Equation 1 for the first index

However,

The

The magnitude of the absolute value at the first frequency index. Then the second fast Fourier transform, assuming that the magnitude of the magnitude through the first Fast Fourier transform is the time base.

Can be obtained as in Equation 2. In addition, j represents a complex number and N represents the period of a signal.

여기서

는 고속 퓨리에 변환 빈 인덱스를 나타내다. 고속 퓨리에 변환 사이즈는 256 샘플인 것을 예를 들어 설명하기로 한다. 이러한 고속 퓨리에 변환 과정은 신호의 기본주파수 (fundamental frequency)를 계산하여, 스펙트럼 절대값의 연속적인 정점을 가지는 하모닉 성분을 분석하는데 적합한 방법이다. here

Denotes the fast Fourier transform bin index. The fast Fourier transform size is 256 samples, for example. This fast Fourier transform process is suitable for analyzing harmonic components with continuous peaks of spectral absolute values by calculating the fundamental frequency of the signal.

도 3은 본 발명의 일실시예와 관련된 주파수 변환을 통해 얻어진 오디오 입력 신호의 패턴을 나타내는 그래프이다.3 is a graph illustrating a pattern of an audio input signal obtained through frequency conversion according to an embodiment of the present invention.

도 3(a)는 음성에 잡음이 부가된 신호와 잡음 신호를 첫 번째 고속 퓨리에 변환을 한 절대값을 나타낸다. 도 3(a)에서 알 수 있듯이 음성에 잡음이 부가된 신호와 잡음 신호의 구별이 쉽지 않을 수 있다. FIG. 3 (a) shows the absolute value of the first fast Fourier transform of the noise-signaled signal and the noise signal. As can be seen in Figure 3 (a) it may not be easy to distinguish between the noise signal and the noise signal added to the voice.

그러나 도 3(b)에서 알 수 있듯이 도 3(a)의 결과에 또 한번의 고속 퓨리에 변환을 수행하면, 음성이 존재하는 구간에서 주기적으로 변동을 거듭하는 하모닉 성분 때문에 음성에 잡음이 부가된 신호와 잡음 신호의 구별이 잘 됨을 확인할 수 있다. However, as shown in FIG. 3 (b), if another fast Fourier transform is performed on the result of FIG. 3 (a), the noise is added to the voice due to the harmonic component that periodically fluctuates in the interval where the voice is present. It can be seen that the noise signal and the noise signal are well distinguished.

즉, 본 발명의 일실시예에 의하면, 오디오 입력 신호를 주파수 영역의 신호로 변환할 때, 주파수 변환 시, 변환부(120)는 2번의 고속 퓨리에 변환을 수행을 통해 시간 영역의 신호를 주파수 영역의 신호로 변환할 수 있다.That is, according to an embodiment of the present invention, when converting an audio input signal into a signal in the frequency domain, during frequency conversion, the converting unit 120 performs two fast Fourier transforms to convert the signal in the time domain into the frequency domain. Can be converted into

도시된 그래프에서 Noise signal은 잡음 신호를 의미하고, Noisy signal은 잡음과 음성이 혼합된 신호를 의미한다. 상기 Noise signal은 오디오 입력 신호에서 기 설정된 프레임(예: 처음 10 프레임)에 대한 평균으로 추정될 수 있다. 왜냐하면 오디오 입력 신호가 처음부터 음성이 존재할 확률이 낮기 때문이다.In the graph shown, a noise signal means a noise signal, and a noisy signal means a signal mixed with noise and voice. The noise signal may be estimated as an average of a preset frame (eg, the first 10 frames) in the audio input signal. This is because the audio input signal has a low probability of having a voice from the beginning.

특징값 추출부(130)는 변환부(120)에서 획득한 주파수 영역의 신호를 이용하여 특징값을 추출할 수 있다(S230). 상기 특징값 추출부(130)는 음성 부재 확률에 음성 부재 확률의 비를 이용하여 오디오 입력 신호의 특징값을 추출할 수 있다. 본 실시예에서는 음성 신호와 잡음 신호가 독립 가우시안 확률 과정 (Independent Gaussian Random Process)을 따른다고 가정하기로 한다.The feature value extractor 130 may extract the feature value using the signal of the frequency domain obtained by the converter 120 (S230). The feature value extractor 130 may extract a feature value of the audio input signal by using a ratio of the voice absence probability to the voice absence probability. In the present embodiment, it is assumed that the speech signal and the noise signal follow an independent Gaussian random process.

이하에서 오디오 입력 신호의 특징값을 추출하는 방법에 대해 상세히 설명하기로 한다.Hereinafter, a method of extracting feature values of an audio input signal will be described in detail.

두 번의 고속 퓨리에 변환을 통한 L 차원의 벡터는 수학식 3으로 나타낼 수 있다.The L-dimensional vector through two fast Fourier transforms may be represented by Equation 3.

는

를 의미하며, 음성에 잡음이 부가된 신호의 번째 주파수 인덱스의 절대값 크기를 나타낸다.

는 깨끗한 음성 신호의 절대값의 크기를 나타내고,

는 잡음 신호의 절대값 크기를 의미한다.

The

It means the magnitude of the absolute value of the th frequency index of the signal with noise added to the voice.

Represents the magnitude of the absolute value of a clean speech signal,

Is the magnitude of the absolute value of the noise signal.

현재의 입력 프레임이 음성 구간이라는 가설(Hypothesis)을

, 비음성 구간이라는 가설을

라고 하면,

와

는 수학식 4와 같이 표현될 수 있다.Hypothesis that the current input frame is a voice interval

, The hypothesis

In other words,

Wow

May be expressed as Equation 4.

와

일 때의 관측값 Y가 생성될 조건부 확률 밀도 함수(Conditional Probability Density Function)는 수학식 5와 수학식 6으로 표현될 수 있다.

Wow

Conditional Probability Density Function to generate the observation value Y at can be expressed by

Equations

5 and 6 below.

즉,

는 음성이 부재할 가설이고, 그 때의 관측값 Y는 N(노이즈)이 된다. 따라서 수학식 5는 음성이 부재할 가설을 세워둔 상태에서 음성이 부재할 확률(편의상, '음성 부재 확률'이라 함)을 나타낸다.

값이 클수록 음성이 존재하지 않을 확률이 높게 된다.In other words,

Is a hypothesis that voice is absent, and the observed value Y at that time becomes N (noise). Therefore, Equation 5 represents the probability of the absence of voice in the state where the hypothesis to be absent is established (for convenience, referred to as the 'voice absence probability').

The larger the value, the higher the probability that no voice exists.

또한,

은 음성이 존재할 가설이고, 그 때의 관측값 Y는 S(음성)+N(노이즈)이 된다. 수학식 6은 음성이 존재할 가설을 세워둔 상태에서 음성이 존재할 확률(편의상, '음성 존재 확률'이라 함)을 나타낸다.

값이 크다는 것은 음성이 존재할 확률이 높다는 의미로 해석될 수 있다.Also,

Is the hypothesis that negative is present, and the observed value Y at that time is S (voice) + N (noise). Equation (6) represents the probability that the voice exists in the state where the hypothesis exists that the voice exists (for convenience, referred to as 'voice existence probability').

A large value may be interpreted to mean that there is a high probability that voice is present.

본 발명의 일실시예에 의하면, 음성 존재 확률과 음성 부재 확률의 비를 이용하여 오디오 입력 신호의 특징값을 추출하여 음성 존재와 음성 부재를 판별할 수 있다. According to an embodiment of the present invention, the presence or absence of a voice may be determined by extracting a feature value of an audio input signal using a ratio of the presence of a voice and a probability of having an absence of a voice.

예를 들어, 특징값 추출부(130)는 수학식 7로 표현되는 음성 부재 확률에 대한 음성 존재 확률의 로그 우도 비를 이용하여 오디오 입력 신호의 특징값을 추출할 수 있다. For example, the feature value extractor 130 may extract a feature value of the audio input signal by using a log likelihood ratio of the voice presence probability with respect to the voice absence probability represented by Equation (7).

잡음 구간(음성 부재 구간)에서는 오디오 입력 신호의 특징과 추정된 잡음(예: 오디오 입력 신호 중 처음 10프레임의 평균)의 특징의 패턴이 유사하여 로그 우도 비 갑이 적게 나올 것이고, 음성 구간에서는 음성 특징의 패턴의 변화로 인하여 로그 우도 비 값이 크게 나올 수 있다. 이와 같은 특징을 이용하여, 음성 부재 구간과 음성 구간을 구분할 수 있다.In the noise section (no audio section), the pattern of the characteristics of the audio input signal and the characteristic of the estimated noise (e.g., the average of the first 10 frames of the audio input signal) are similar, resulting in a low log likelihood ratio. Due to the change in the pattern of the feature, the log likelihood ratio value may be large. Using such a feature, the voice absence section and the voice section may be distinguished.

음성 구간 검출부(140)는 상기 특징 추출부(130)에서 추출한 특징값에 소정의 필터를 적용하여 음성 구간을 검출할 수 있다(S240). 상기 소정 필터의 예로 원점 대칭 필터가 포함될 수 있다. 상기 원점 대칭 필터는 에지 필터의 역할을 수행할 수 있다. 예를 들어, 상기 원점 대칭 필터는 수학식 8과 같이 표현될 수 있다. The voice section detector 140 may detect a voice section by applying a predetermined filter to the feature value extracted by the feature extractor 130 (S240). An example of the predetermined filter may include an origin symmetric filter. The origin symmetric filter may serve as an edge filter. For example, the origin symmetric filter may be expressed as Equation 8.

도 4는 수학식 8과 표현되는 원점 대칭 필터를 나타낸다. 여기서 W는 필터 길이와 관계되는 변수이며, i는 -W부터 W까지의 정수, A, K는 필터 파라미터를 나타낸다. 도시된 필터는 W=7 일 때의 필터 응답 그래프를 나타낸다. 도시된 그래프에서 A=0.41, K₁=1.538, K₂=1.468, K₃=-0.078, K₄=-0.036, K₅=-0.872, K₆=-0.56이다.4 illustrates an origin symmetric filter represented by Equation (8). Where W is a variable related to the filter length, i is an integer from -W to W, and A and K represent filter parameters. The illustrated filter represents a filter response graph when W = 7. In the graph shown, A = 0.41, K ₁ = 1.538, K ₂ = 1.468, K ₃ = -0.078, K ₄ = -0.036, K ₅ = -0.872, K ₆ = -0.56.

음성 구간 검출부(140)는 추출된 특징값에 원점 대칭 필터를 적용하여 출력

을 구할 수 있다. 상기 출력

은 수학식 9와 같이 나타낼 수 있다.The voice section detector 140 outputs an original symmetric filter to the extracted feature values.

Can be obtained. Above output

Can be expressed by Equation (9).

여기서 n은 프레임 인덱스를 나타낸다. 상기 수학식 9에 따라 출력

을 계산하면, 비음성 구간(음성 부재 구간)에서의 특징값이 그 크기에 상관없이 일정하면 상기 출력값이 0에 가까운 결과가 나오다가 음성이 존재하여 특징값이 커지면 출력 역시 커지게 된다. 상기 특징값이 작아지면 필터 출력 역시 작아지게 된다. Where n represents a frame index. Output according to Equation 9

If the feature value in the non-voice section (voice member section) is constant regardless of its magnitude, the output value is close to zero, and the voice is present and the output value is also increased. The smaller the feature value, the smaller the filter output.

본 발명의 일실시예에 의하면, 음성 구간 검출부(140)는 출력

의 상태 천이 동작을 통하여 음성 구간의 시작 지점과 종료 지점을 검출할 수 있다. 즉, 음성 구간 검출부(140)는 상태 천이 모델을 이용하여 음성 구간의 시작 지점과 종료 지점을 검출할 수 있다.According to an embodiment of the present invention, the voice section detection unit 140 outputs

The start point and the end point of the speech section may be detected by the state transition operation of the. That is, the speech section detector 140 may detect the start point and the end point of the speech section by using the state transition model.

도 5는 음성 구간의 시작 지점과 종료 지점 검출을 위한 모델의 일례를 나타낸다.5 illustrates an example of a model for detecting a start point and an end point of a voice interval.

도 5에서 Silence는 비음성 구간(음성 부재 구간)을 나타내고, In speech는 음성 구간을 나타낸다. Leaving speech는 음성 구간이지만 비음성 구간으로 변할 수 있는 구간을 의미한다. 하한값 T_L(lower threshold), 상한값 T_U (upper threshold), Gap은 끝나는 점을 결정하기 위한 허용치로써 실험적으로 정하는 상수이다. 단, 상한값은 하한값 보다 크다고 가정한다. In FIG. 5, Silence represents a non-voice section (voice absence section), and In speech represents a voice section. Leaving speech is a speech section, but it means a section that can be changed to a non-speech section. The lower limit T _L (lower threshold), upper limit T _U (upper threshold), and Gap are constants determined experimentally as tolerances to determine the end point. However, it is assumed that the upper limit is larger than the lower limit.

도 5에 도시된 상태 천이 모델을 이용하면, 음성 구간 검출부(140)는

이 T_U 보다 작으면 음성이 없는 비음성 구간(Silence)으로 판단하고,

이 T_U 보다 커지면 음성 구간이 시작된 것으로 판단할 수 있다. 그리고 음성 구간 검출부(140)는 그 부분을 음성 구간(In speech)의 시작점으로 검출할 수 있다.Using the state transition model shown in FIG. 5, the speech section detector 140

This T _U If less, it is determined that there is no voice (Silence),

This T _U If it becomes larger, it may be determined that the voice section has started. The voice section detector 140 may detect the portion as a start point of the speech section.

한편, 음성 구간 검출부(140)는

이 T_L 보다 작아지면 아직 음성 구간이긴 하지만 비음성 구간으로 바뀔 가능성이 있는 구간(Leaving speech)으로 판단하고, Count를 0으로 간주한다. Count는

이 T_L과T_U 사이에 연속적으로 존재하는 회수를 의미한다. 음성 구간 검출부(140)는 Count가 Gap 보다 작으면 Leaving speech로 판단하고, Count가 Gap 보다 크면 Silence로 판단한다. Silence로 판단되는 해당 프레임이 음성 구간의 종료 지점이 된다. 또한, 음성 구간 검출부(140)는

이 T_L 보다 작아지면 Count를 0으로 잡고 해당 프레임을 Leaving speech 단계로 유지할 수 있다. 그리고

이 T_U 보다 커지면 다시 In speech 구간으로 간주할 수 있다.Meanwhile, the voice section detector 140

If it is smaller than T _{L, it} is still a speech section, but it is determined to be a leaving speech (Leaving speech), and Count is regarded as 0. Count is

With this T _L T _U It means the number of times that exists continuously between. If the count is less than Gap, the voice section detector 140 determines Leaving speech. If the count is greater than Gap, the voice section detector 140 determines Silence. The frame determined to be silence becomes the end point of the voice section. In addition, the voice section detector 140

This T _L If it becomes smaller, we can set Count to 0 and keep the frame in Leaving speech level. And

If it is larger than T _U , it can be regarded as In speech section again.

도 5에 도시된 상태 천이 모델을 이용하면, 잡음의 크기에 따라 상한과 하한을 조절함으로써, 정밀한 음성 구간 검출이 가능하다.By using the state transition model shown in FIG. 5, precise voice section detection is possible by adjusting the upper limit and the lower limit according to the amount of noise.

도 6은 본 발명의 일실시예와 관련된 음성 인식 방법을 통해 얻어진 음성 구간 검출 결과를 나타내는 그래프이다.6 is a graph illustrating a voice section detection result obtained through a voice recognition method according to an embodiment of the present invention.

도 6은 자동차 잡음 환경(자동차 안에서 잡음이 존재하는 환경)에서 오디오 입력 신호를 대상으로 로그 우도 비 및 원점 대칭 필터를 이용하여 음성 구간을 검출한 실제 결과를 나타낸다.6 illustrates an actual result of detecting a speech section using a log likelihood ratio and an origin symmetric filter for an audio input signal in an automobile noise environment (an environment in which noise exists in an automobile).

도 6(a)는 상기 자동차 잡음 환경(잡음과 음성이 혼합된 환경)에서의 샘플 단위의 특징값을 나타낸다. 그리고 도 6(b)는 잡음이 없는 깨끗한 환경에의 음성 신호의 특징값을 나타낸다. 좀 더 구체적으로 도 6(a)는 도 6(b)와 같이 깨끗한 환경에의 음성 신호에 잡음 혼합한 오디오 입력 신호의 특징값을 나타낸다.FIG. 6 (a) shows feature values in units of samples in the vehicle noise environment (an environment in which noise and voice are mixed). 6 (b) shows the characteristic values of the audio signal in a clean environment without noise. More specifically, FIG. 6 (a) shows characteristic values of an audio input signal in which noise is mixed with a voice signal in a clean environment as shown in FIG. 6 (b).

그리고 도 6(c)는 수학식 7을 이용하여 오디오 입력 신호의 특징값을 추출한 그래프이다. 도시된 바와 같이, 도 6(c)는 로그 우도 비를 이용하여 오디오 입력 신호의 특징값을 추출하기 때문에, 비음성 구간에서는 0에 가까운 특징값을 가지다가 음성 구간에서 특징값이 크게 나오는 것을 확인할 수 있다. 6 (c) is a graph from which feature values of an audio input signal are extracted using Equation 7. As shown in FIG. 6 (c), since the feature value of the audio input signal is extracted using the log likelihood ratio, it is confirmed that the feature value is close to zero in the non-speech section and the feature value is large in the speech section. Can be.

도 6(c)는 수작업을 통해 획득된 도 6(b)와 유사하게 음성 구간의 특징값이 나타나기에, 에지 필터를 적용하지 않더라고 음성 구간이 예측 가능하다. 6 (c) is similar to FIG. 6 (b) obtained by hand, and thus, a feature value of the voice section is displayed, and thus the voice section can be predicted even if the edge filter is not applied.

또한, 도 6(d)는 도 6(c)와 같은 오디오 입력 신호의 특징값에 수학식 8에 해당하는 원점 대칭 필터를 적용한 결과를 나타낸 그래프이다. 본 발명의 일실시예에 의한 음성 구간 검출은 도 6(d)를 통해 확인할 수 있듯이(도 6(d)에서 나타내는 음성 구간과 도 6(b)에서 나타내는 음성 구간이 유사함), 잡음 환경에서도 정확한 음성 구간 검출이 가능하다.6 (d) is a graph showing a result of applying the origin symmetric filter corresponding to Equation 8 to the feature values of the audio input signal as shown in FIG. 6 (c). The voice section detection according to an embodiment of the present invention can be confirmed through FIG. 6 (d) (the voice section shown in FIG. 6 (d) and the voice section shown in FIG. 6 (b) are similar), even in a noisy environment. Accurate voice section detection is possible.

상술한 음성 인식 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 판독 가능한 기록 매체에 기록될 수 있다. 이때, 컴퓨터로 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 한편, 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The above-described speech recognition method may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable recording medium. At this time, the computer-readable recording medium may include program commands, data files, data structures, and the like, alone or in combination. On the other hand, the program instructions recorded on the recording medium may be those specially designed and configured for the present invention or may be available to those skilled in the art of computer software.

컴퓨터로 판독 가능한 기록매체에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. The computer-readable recording medium includes a magnetic recording medium such as a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, an optical medium such as a CD-ROM and a DVD, a magnetic disk such as a floppy disk, A magneto-optical media, and a hardware device specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

한편, 이러한 기록매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다.The recording medium may be a transmission medium such as a light or metal line, a wave guide, or the like including a carrier wave for transmitting a signal designating a program command, a data structure, and the like.

또한, 프로그램 명령에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The program instructions also include machine language code, such as those generated by the compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

상기와 같이 설명된 음성 인식 방법 및 장치는 상기 설명된 실시예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 상기 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.The speech recognition method and apparatus described above may not be limitedly applied to the configuration and method of the embodiments described above, but the embodiments may be a combination of all or part of the embodiments selectively so that various modifications may be made. It may be configured.

100: 음성 인식 장치
110: 수신부
120: 변환부
130: 특징값 추출부
140: 음성 구간 검출부
150: 제어부100: speech recognition device
110:
120: converter
130: feature value extraction unit
140: voice section detection unit
150:

Claims

Receiving an audio input signal in a time domain in which noise and voice signals are mixed;
Converting the audio input signal in the time domain into a signal in the frequency domain;
Extracting feature values of the audio input signal using a ratio of speech presence probability to speech absence probability, wherein the speech presence probability and speech absence probability are obtained using the transformed frequency domain signal;
Detecting a speech section by applying an origin symmetric filter to the extracted feature value;
And converting the signal in the frequency domain to perform two fast Fourier transforms on the audio input signal in the time domain.

The method of claim 1, wherein the voice section detecting step is
Obtaining an output signal by applying an origin symmetric filter to the feature value of the extracted audio signal;
Detecting a start point of the voice section and an end point of the voice section by using the upper limit value and the lower limit value set in the obtained output signal.

The method of claim 2, wherein the end point of the voice interval is
And the output signal is detected based on the number of times the output signal is continuously present between the upper limit value and the lower limit value.

delete

The method of claim 1, wherein the negative absence probability is
The speech recognition method of claim 1, wherein the speech signal is obtained by using a feature value of a predetermined frame among the converted frequency domain signals.

A receiver configured to receive an audio input signal in a time domain in which noise and voice signals are mixed;
A converter converting the audio input signal in the time domain into a signal in the frequency domain;
A feature value extractor for extracting a feature value of the audio input signal using a ratio of a voice presence probability to a voice absence probability, wherein the voice presence probability and the voice absence probability are obtained using the converted frequency domain signal. ;
A speech section detector for detecting a speech section by applying an origin symmetric filter to the extracted feature values; And
And a controller for controlling the receiver, the converter, the feature value extractor, and the voice interval detector.
And the conversion unit performs two fast Fourier transforms on the audio input signal in the time domain.

The method of claim 6, wherein the voice section detector
An output signal is obtained by applying an origin symmetric filter to the feature value of the extracted audio signal, and the start point and the end point of the voice section are detected by using the upper limit value and the lower limit value set in the obtained output signal. Voice recognition device.

The method of claim 7, wherein the end point of the voice interval is
And the output signal is detected based on the number of times the output signal is continuously present between the upper limit value and the lower limit value.

delete

7. The method of claim 6, wherein the negative absence probability is
The speech recognition apparatus of claim 1, wherein the speech recognition apparatus is obtained by using a feature value of a predetermined frame among the converted frequency domain signals.