KR100345402B1

KR100345402B1 - An apparatus and method for real - time speech detection using pitch information

Info

Publication number: KR100345402B1
Application number: KR1019990050318A
Authority: KR
Inventors: 이항섭; 이영직
Original assignee: 한국전자통신연구원
Priority date: 1999-11-12
Filing date: 1999-11-12
Publication date: 2002-07-26
Also published as: KR20010046522A

Abstract

본 발명은 음성 인식 및 음성 코딩 등 음성을 입력으로 사용하는 시스템에서 음성의 특징 중의 하나인 피치(Pitch) 정보를 이용하여, 입력되는 신호에서 음성만을 실시간으로 검출하는 장치 및 그 방법을 제공하는데 그 목적이 있다.The present invention provides an apparatus and method for real-time detection of only voice from an input signal using pitch information, which is one of features of voice, in a system using voice as an input such as voice recognition and voice coding. There is a purpose.

본 발명에 따르면, 음성 신호가 입력되는 음성 입력부, 상기 음성 입력부에 의하여 입력된 아날로그(Analog) 음성 신호를 디지털(Digital) 음성 신호로 변환하는 A/D 변환부, 상기 A/D 변환부에 의하여 변환된 디지털 음성 신호에서 음성의 시작점을 검출하는 시작점 검출부, 상기 A/D 변환부에 의하여 변환된 디지털 음성 신호에서 음성의 끝점을 검출하는 끝점 검출부를 포함하여 이루어진 음성 검출 장치에 있어서, 상기 시작점 검출부는, 상기 A/D 변환부에 의하여 변환된 디지털 음성 신호에서 기설정된 길이를 갖는 프레임(Frame)을 생성하는 프레임 생성 수단과; 상기 프레임 생성 수단에 의하여 생성된 프레임 중 앞부분의 프레임으로부터 샘플(Sample)들을 추출하여 상기 추출된 샘플들의 에너지값만을 이용하여 피치(Pitch)의 임계치를 결정하는 임계치 설정 수단과; 상기 임계치 설정 수단에 의하여 결정된 임계치 값들을 사용하여 이후에 입력되는 음성 신호의 프레임들로부터 음성의 시작점을 검출하는 검출 수단을 포함하여 이루어진 것을 특징으로 하는 음성 검출 장치가 제공된다.According to the present invention, a voice input unit into which a voice signal is input, an A / D converter for converting an analog voice signal input by the voice input unit into a digital voice signal, and the A / D converter A voice detection device comprising a start point detector for detecting a start point of a voice in a converted digital voice signal, and an end point detector for detecting an end point of a voice in a digital voice signal converted by the A / D converter. Frame generation means for generating a frame having a predetermined length from the digital voice signal converted by the A / D converter; Threshold setting means for extracting samples from frames earlier in the frame generated by the frame generating means and determining a threshold value of the pitch using only energy values of the extracted samples; And a detecting means for detecting a starting point of a voice from frames of a voice signal which is subsequently input using the threshold values determined by the threshold setting means.

Description

An apparatus and method for real time speech detection using pitch information {An apparatus and method for real-time speech detection using pitch information}

본 발명은 음성 인식 및 음성 코딩(Coding) 등 음성을 입력으로 사용하는 시스템에서 입력되는 신호의 음성만을 실시간으로 검출하는 장치 및 방법에 관한 것이며, 특히, 음성의 특징 중의 하나인 피치(Pitch) 정보를 이용하여 잡음과 음성이 공존하는 신호에서도 음성의 유무를 정확히 판단하여 음성 부분을 검출하는 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and method for detecting only a voice of a signal input in real time in a system using voice as an input, such as voice recognition and voice coding, and in particular, pitch information, which is one of features of voice. The present invention relates to an apparatus and a method for detecting a voice part by accurately determining the presence or absence of voice even in a signal in which noise and voice coexist.

일반적으로 음성 신호로부터 순수한 음성 부분만을 주변 잡음과 정확히 분리하는 일은 음성 인식, 음성 코딩 및 음석 분석 등의 음성 공학의 거의 모든 분야에서 매우 중요하다. 최근에 음성 인식 및 코딩 기술이 실생활에 적용되면서, 이러한 기능의 성능을 결정하는데 중요한 역할을 담당하는 음성 검출의 중요성이 부각되고 있다.In general, the accurate separation of only pure speech parts from the speech signal from the ambient noise is very important in almost all areas of speech engineering such as speech recognition, speech coding and speech analysis. Recently, as speech recognition and coding techniques have been applied in real life, the importance of speech detection, which plays an important role in determining the performance of these functions, has been highlighted.

전통적으로 음성 부분을 검출하는 방법은 시간 영역의 파라미터(Parameter)들을 사용하는 방법과 주파수 영역의 파라미터를 사용하는 방법 및 상기 두 가지 영역을 같이 사용하는 방법이 있다.Traditionally, a method of detecting a speech part includes a method of using parameters in the time domain, a method of using parameters in the frequency domain, and a method of using the two domains together.

시간 영역의 파라미터를 사용하는 방법은 주로 단구간 에너지와 영교차율을 사용하는 방법이다. 또한, 주파수 영역의 파라미터를 사용하는 방법은 필터 뱅크(Filter Bank) 등을 사용하여 인간의 음성이 존재하는 특정한 주파수 영역만을 분리하여 사용하는 방식이고, 시간과 주파수 영역의 파라미터를 같이 사용하는 방법은 우선 특정 주파수 영역만을 분리한 후, 여기서 시간 영역의 파라미터를 뽑아내어 사용하는 방법이다.The method of using the time domain parameter is mainly a method of using short-term energy and zero crossing rate. In addition, a method of using the parameters of the frequency domain is a method of separating and using only a specific frequency region in which a human voice exists using a filter bank, and the method of using the parameters of the time and frequency domain together. First, only a specific frequency domain is separated, and then a time domain parameter is extracted and used.

이와 같은 방법들은 나름대로의 장점과 단점을 가지는 바, 우선 시간 영역의 파라미터를 사용하는 방법은 계산이 간단하고 쉽게 구현할 수 있는 장점이 있는 반면, 입력되는 신호의 세기와 신호 대 잡음비(Signal - To -Noise)의 크기에 따라 그 성능에 많은 차이를 보인다.These methods have their advantages and disadvantages. First, the method of using time-domain parameters has the advantage that the calculation is simple and easy to implement, while the strength of the input signal and the signal-to-noise ratio (Signal-To- The performance varies greatly depending on the size of the noise.

그에 반하여 주파수 영역의 파라미터를 사용하는 방법과 시간과 주파수 영역의 파라미터 모두를 사용하는 방법은 SNR이 작은 환경에서 음성 검출 성능이 시간 영역 파라미터만 사용하는 경우보다 좋으나, 계산식이 복잡하고 시간이 많이 걸리며, 알고리즘의 특성상 실시간 구현이 어렵다는 문제점이 있다.On the other hand, the method of using the parameters in the frequency domain and using the parameters in both the time and frequency domain is better than the case of using only the time domain parameters for the voice detection performance in a small SNR environment, but the calculation is complicated and time-consuming. However, due to the nature of the algorithm, there is a problem in that real-time implementation is difficult.

그래서, 현재 음성 인식 시스템에 사용되는 음성 검출 알고리즘은 그 시스템의 특성에 따라 잡음이 존재하는 상황에서 실시간 음성 검출이 필요 없고, 정확한 끝점 검출이 필요한 경우는 주파수 영역의 파라미터 사용 방법을, 비교적 조용한 환경에서 실시간 음성 검출을 필요로 하는 경우에는 시간 영역의 파라미터를 사용하는 방법이 쓰이고 있다.Therefore, the voice detection algorithm used in the current speech recognition system does not require real-time voice detection in the presence of noise depending on the characteristics of the system, and when accurate endpoint detection is required, the method of using the parameters in the frequency domain is relatively quiet. In case of real-time voice detection, the method of using time domain parameter is used.

위에서 서술한 바와 같이 실시간으로 동작하는 음성 검출기를 구현하기 위하여는 주로 시간 영역의 파라미터를 사용하게 되는데, 이 때 대표적으로 사용되는 파라미터가 단구간 에너지와 영교차율이다. 그러나, 영교차율은 전화선과 같이 고주파 대역이 제거되며, 채널 잡음이 존재하는 경우에는 음성의 특성을 제대로 반영할 수 없고, 에너지는 임계치의 크기를 따라 에너지가 비교적 작은 'ㄱ', 'ㅂ' 및 'ㅌ'의 파열음과 'ㄴ', 'ㄹ' 및 'ㅁ' 등의 유음을 검출하지 못하는 경우와 입술 소리, 책상 두드리는 소리 등 조금 큰 에너지를 가지며 불규칙적으로 발생하는 비음성을 음성으로 잘못 판단하는 경우가 발생한다.As described above, in order to implement a voice detector operating in real time, a parameter in the time domain is mainly used. In this case, representative parameters are short-term energy and zero crossing rate. However, the zero crossing rate is such that the high frequency band is removed like a telephone line, and in the presence of channel noise, the characteristics of voice cannot be properly reflected. When the voice of 'ㅌ' and 'B', 'ㄹ' and 'ㅁ' are not detected, and the voice of lip and the tapping of the desk have a little energy. The case occurs.

이러한 문제점들을 해결하기 위하여, 유성음의 안정 구간에서 피치의 임계치를 구한 후, 이 구간 이후의 피치의 변화율과 정규화된 피치 변화율을 사용하여 음성의 시작점과 끝점을 검출하는 방법이 사용되기도 한다. 하지만, 이 방법은 유성음의 안정 구간에서 피치의 임계치를 구하므로, 실시간 구현을 할 수가 없다.In order to solve these problems, a method of detecting a start point and an end point of a voice using a pitch change rate and a normalized pitch change rate after determining the pitch threshold in the stable section of the voiced sound may be used. However, since this method finds the threshold of pitch in the stable section of voiced sound, it cannot be implemented in real time.

일반적으로 사용되는 프레임별 음성 / 비음성 판별을 이용한 실시간 음성 검출 방법을 설명하면 다음과 같다.A real-time voice detection method using speech / non-voice discrimination that is generally used will be described as follows.

먼저, 입력되는 초기의 적당한 수의 프레임들을 사용하여 음성 검출에 사용할 임계치 값들을 결정한다. 이후, 상기 임계치 값을 사용하여 입력되는 프레임들을 각각 음성 또는 비음성 프레임으로 판단한다. 이렇게 음성 / 비음성으로 판단된 프레임들을 사용하여 음성의 시작점 및 끝점을 검출하는 방법은 연속되는 일정한 개수(P1, P2)의 프레임 내에 존재하는 음성 / 비음성 프레임들의 수(N1, N2)를 사용한다. 즉, 음성 시작점의 경우는 음성이라고 판단된 프레임들이 P1 개의 프레임들 내에 N1 개의 수만큼 존재하는 지점을, 음성 끝점의 경우는 음성의 시작점이 검출된 이후, 비음성이라고 판단되는 프레임들이 P2 개의 프레임들 내에 N2 개의 수만큼 존재하는 지점을 음성의 시작점 / 끝점으로 각각 판단한다. 여기서 P1, P2와 N1, N2 의 길이는 사용되는 알고리즘 및 파라미터에 따라 달라질 수 있으며, 음성부 검출의 판단 방법도 달라질 수 있다.First, the threshold value to be used for voice detection is determined using the initial appropriate number of frames that are input. Thereafter, the input frames are determined as voice or non-voice frames, respectively, using the threshold value. The method of detecting the start and end points of the speech using the frames determined as voice / non-voice uses the number of voice / non-voice frames (N1, N2) existing in a certain number of consecutive frames (P1, P2). do. That is, in the case of the voice start point, there are N1 frames in the P1 frames, and in the case of the voice end point, the frames determined to be non-voice are P2 frames after the start point of the voice is detected. N2 points in the field are determined as the start point and the end point of the voice, respectively. Here, the lengths of P1, P2, N1, and N2 may vary according to algorithms and parameters used, and a method of determining voice portion detection may also vary.

실시간 음성 검출 방법은 실시간 처리의 특성상 음성 처리의 시간을 매우 단축시킬 수 있다. 예를 들어 만일 5초의 길이를 가지는 음성을 입력받아 처리하는 시스템을 사용한다고 가정했을 때, 시스템의 처리 속도가 음성 길이의 1배라고 한다면, 실시간 음성 검출을 사용하지 않는다면, 5초 길이의 음성을 모두 입력받은 후, 처리하여야 하므로, 음성 입력 종료 후, 5초가 있어야 결과가 나오는 것이다. 그러나, 만일 이 시스템이 실시간 음성 검출 방법을 사용한다면, 음성의 시작점이 검출된 직후부터 바로 처리를 하여야 하므로, 음성 입력이 종료된 즉시 결과가 나올 수 있어야 한다.The real-time voice detection method can greatly shorten the time of voice processing due to the characteristics of the real-time processing. For example, suppose that you use a system that receives and processes a voice that has a length of 5 seconds. If the system's processing speed is 1 times the length of the voice, if you do not use real-time voice detection, 5 seconds of voice After receiving all the input, it must be processed, so after 5 seconds of voice input, the result comes out. However, if the system uses the real-time voice detection method, the processing must be performed immediately after the start point of the voice is detected, so that the result can be obtained immediately after the voice input is terminated.

본 발명은 상기와 같은 종래기술의 문제점을 해결하기 위하여 안출된 것으로서, 음성의 특징 중의 하나인 피치 정보를 시간 영역에서 에너지 임계치를 구하는 과정에서 간단히 얻어내어 이를 음성 구간 검출에 추가적으로 이용함으로써, 불규칙한 잡음이 존재하는 입력 신호에서 음성 구간을 실시간으로 정확히 검출하는 음성 검출 장치 및 그 방법을 제공하는데 그 목적이 있다.The present invention has been made to solve the problems of the prior art as described above, by simply obtaining the pitch information which is one of the characteristics of speech in the process of obtaining the energy threshold in the time domain, and additionally used in the speech section detection, irregular noise It is an object of the present invention to provide a voice detection device and a method for accurately detecting a voice section in real time from an existing input signal.

도 1은 본 발명의 일 실시예에 따른 실시간 음성 검출 과정을 나타낸 흐름도이고,1 is a flowchart illustrating a real-time voice detection process according to an embodiment of the present invention.

도 2는 본 발명의 일 실시예에 따른 피치 임계치 alpha를 이용한 프레임의 음성 / 비음성 판별 과정을 나타낸 흐름도이다.2 is a flowchart illustrating a speech / non-voice discrimination process of a frame using a pitch threshold alpha according to an embodiment of the present invention.

앞서 설명한 바와 같은 목적을 달성하기 위한 본 발명에 따르면, 음성 신호가 입력되는 음성 입력부, 상기 음성 입력부에 의하여 입력된 아날로그(Analog) 음성 신호를 디지털(Digital) 음성 신호로 변환하는 A/D 변환부, 상기 A/D 변환부에 의하여 변환된 디지털 음성 신호에서 음성의 시작점을 검출하는 시작점 검출부, 상기 A/D 변환부에 의하여 변환된 디지털 음성 신호에서 음성의 끝점을 검출하는 끝점 검출부를 포함하여 이루어진 음성 검출 장치에 있어서, 상기 시작점 검출부는, 상기 A/D 변환부에 의하여 변환된 디지털 음성 신호에서 기설정된 길이를 갖는 프레임(Frame)을 생성하는 프레임 생성 수단과; 상기 프레임 생성 수단에 의하여 생성된 프레임 중 앞부분의 프레임으로부터 샘플(Sample)들을 추출하여 상기 추출된 샘플들의 에너지값만을 이용하여 피치(Pitch)의 임계치를 결정하는 임계치 설정 수단과; 상기 임계치 설정 수단에 의하여 결정된 임계치 값들을 사용하여 이후에 입력되는 음성 신호의 프레임들로부터 음성의 시작점을 검출하는 검출 수단을 포함하여 이루어진 것을 특징으로 하는 음성 검출 장치가 제공된다.According to the present invention for achieving the object as described above, an audio input unit for inputting a voice signal, A / D conversion unit for converting an analog audio signal input by the voice input unit into a digital voice signal And an end point detector for detecting a start point of the voice in the digital voice signal converted by the A / D converter, and an end point detector for detecting an end point of the voice in the digital voice signal converted by the A / D converter. An apparatus for detecting speech, comprising: frame generating means for generating a frame having a predetermined length from a digital speech signal converted by the A / D converter; Threshold setting means for extracting samples from frames earlier in the frame generated by the frame generating means and determining a threshold value of the pitch using only energy values of the extracted samples; And a detecting means for detecting a starting point of a voice from frames of a voice signal which is subsequently input using the threshold values determined by the threshold setting means.

또한, 음성 신호가 입력되는 음성 입력 단계, 상기 음성 입력 단계에서 입력된 아날로그(Analog) 음성 신호를 디지털(Digital) 음성 신호로 변환하는 A/D 변환 단계, 상기 A/D 변환 단계에서 변환된 디지털 음성 신호로부터 음성의 시작점을 검출하는 시작점 검출 단계, 상기 A/D 변환 단계에서 변환된 디지털 음성 신호로부터 음성의 끝점을 검출하는 끝점 검출 단계를 포함하여 이루어진 음성 검출 방법에 있어서, 상기 시작점 검출 단계는, 상기 A/D 변환 단계에서 변환된 디지털 음성 신호로부터 기설정된 길이를 갖는 프레임(Frame)을 생성하는 제 1 단계와; 상기 제 1 단계에서 생성된 프레임 중 앞부분의 프레임으로부터 샘플(Sample)들을 추출하여 상기 추출된 샘플들의 에너지값만을 이용하여 피치(Pitch)의 임계치를 결정하는 제 2 단계와; 상기 제 2 단계에서 결정된 임계치 값들을 사용하여 이후에 입력되는 음성 신호의 프레임들로부터 음성의 시작점을 검출하는 제 3 단계를 포함하여 이루어진 것을 특징으로 하는 음성 검출 방법이 제공된다.Also, a voice input step of inputting a voice signal, an A / D conversion step of converting an analog voice signal input in the voice input step into a digital voice signal, and a digital signal converted in the A / D conversion step In the speech detection method comprising a starting point detecting step of detecting the starting point of the speech from the speech signal, detecting the end point of the speech from the digital speech signal converted in the A / D conversion step, the starting point detecting step A first step of generating a frame having a predetermined length from the digital voice signal converted in the A / D conversion step; A second step of extracting samples from frames earlier in the frame generated in the first step and determining a threshold of a pitch using only energy values of the extracted samples; And a third step of detecting a starting point of a voice from frames of a voice signal which is subsequently input using the threshold values determined in the second step.

또한, 컴퓨터에, 음성 신호가 입력되는 제 1 단계와; 상기 제 1 단계에서 입력된 아날로그(Analog) 음성 신호를 디지털(Digital) 음성 신호로 변환하는 제 2 단계와; 상기 제 2 단계에서 변환된 디지털 음성 신호로부터 기설정된 길이를 갖는 프레임(Frame)을 생성하는 제 3 단계와; 상기 제 3 단계에서 생성된 프레임 중 앞부분의 프레임으로부터 샘플(Sample)들을 추출한 후, 상기 추출된 샘플들의 프레임 에너지 평균값을 상기 추출된 샘플수로 나눈 후 기설정된 가중치를 곱한 값을 피치(Pitch)의 임계치로 결정하는 제 4 단계와; 상기 제 4 단계에서 결정한 1 프레임 내의 피치의 임계치를 초과하는 샘플들의 수가 연속해서 기설정된 n 개 이상 존재하면, 상기 프레임을 음성 프레임으로 판단하는 제 5 단계를 포함하여 이루어진 것을 실행시킬 수 있는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체가 제공된다.In addition, a first step of inputting a voice signal to the computer; A second step of converting the analog voice signal input in the first step into a digital voice signal; A third step of generating a frame having a predetermined length from the digital voice signal converted in the second step; After extracting the samples from the previous frame among the frames generated in the third step, the frame energy average value of the extracted samples is divided by the number of extracted samples, and the value obtained by multiplying a predetermined weight is obtained. A fourth step of determining a threshold; And a fifth step of determining the frame as a voice frame when the number of samples exceeding the threshold value of the pitch in one frame determined in the fourth step is n or more predetermined. Provided is a computer readable recording medium having recorded thereon.

본 발명은 불규칙적으로 발생하는 비음성이 포함된 입력 신호에서 영교차율과 에너지만을 가지고 정확히 검출하지 못하는 음성 부분을 검출하기 위하여 추가적으로 음성의 특징 중의 하나인 피치 정보를 이용하여 실시간 음성 검출의 정확성을 높이고자 한다. 이러한 방법은 다음과 같은 실험 음성학적 지식을 기반으로 고안되었다.The present invention further improves the accuracy of real-time speech detection by using pitch information, which is one of the features of speech, in order to detect a speech portion that cannot be detected accurately with only zero crossing rate and energy in an irregularly generated input signal. Let's do it. This method was devised based on the following experimental phonetic knowledge.

첫째, 인간의 음성 신호는 대부분 저주파 영역인 1 KHz 이내에 파워(Power)가 몰려 있다.First, most human voice signals have power within 1 KHz, which is the low frequency range.

둘째, 음성 특히 유성음의 특징인 피치는 성대의 떨림에 의하여 생성된다.Second, pitch, which is characteristic of voice, especially voiced sounds, is produced by the tremor of the vocal cords.

셋째, 단구간 에너지를 임계치로 사용하는 음성 검출 방법에서의 가장 많이 나타나는 오류는 임계치 이상의 에너지 값을 가지는 유성음 구간의 길이가 짧기 때문에 발생한다.Third, the most common error in the voice detection method using the short section energy as a threshold occurs because the length of the voiced sound section having an energy value above the threshold is short.

넷째, 에너지 값이 작은 유성음 구간도 피치는 반드시 존재한다.Fourth, the pitch is necessarily present in the voiced sound section having a small energy value.

이를 상세히 설명하면 다음과 같다.This will be described in detail as follows.

인간의 음성 신호는 대부분 저주파 영역인 1 KHz 이내에 파워가 몰려 있으며, 또한 유성음의 특징인 피치는 남성과 여성을 모두 포함하여 평균 100 Hz 내지 250 Hz 사이에 존재한다. 이러한 피치 정보는 유성음과 무성음, 음성과 비음성을구분짓는 중요한 특징이다.Human voice signals are mostly concentrated within 1 KHz, the low frequency range, and the pitch, which is characteristic of voiced sounds, exists between 100 and 250 Hz on average, including both males and females. Such pitch information is an important feature to distinguish voiced and unvoiced voices and voices and non-voices.

100 Hz는 1/100 초 주기를 나타내고, 250 Hz는 1/250 초의 주기를 나타내는데, 이를 msec 으로 표현하면, 10 ms ~ 4 ms 이다. 즉, 음성 인식에 사용되는 프레임의 길이가 보통 10 ms ~ 20 ms 정도를 사용하므로, 이는 1 프레임에 적어도 1 ~ 5 개의 피치가 존재한다는 의미이다.100 Hz represents a period of 1/100 second, 250 Hz represents a period of 1/250 seconds, which is expressed in msec, 10 ms to 4 ms. That is, since the length of a frame used for speech recognition usually uses about 10 ms to 20 ms, this means that at least 1 to 5 pitches exist in one frame.

이를 다시 샘플(Sample)로 표현하면, 일반적으로 음성 인식을 위하여는 8 kHz ~ 16 kHz 정도의 샘플링 주파수를 사용하므로, 1 프레임은 80 ~ 320 샘플들로 구성된다. 이는 한 피치를 형성하는 샘플들의 수는 16 ~ 320 샘플들이라는 의미가 된다. 이렇게 한 피치를 형성하는 인접한 샘플들의 값은 급격한 변화를 보이지 않는 특성이 있다.Expressed again as a sample, since a sampling frequency of about 8 kHz to about 16 kHz is generally used for speech recognition, one frame includes 80 to 320 samples. This means that the number of samples forming one pitch is 16 to 320 samples. The values of adjacent samples forming one pitch have a characteristic of not showing a sudden change.

이러한 특성을 이용하여 1 프레임 내에서 프레임 에너지를 계산하는 과정에서 추가적인 계산없이 인접한 몇 개의 샘플들의 에너지 값을 사용하여 피치의 존재 유무를 판단하여 에너지나 영교차율 등 다른 판단 기준과 병행하여 현재 프레임의 음성 / 비음성 판단에 사용한다.In the process of calculating the frame energy within one frame using this characteristic, the existence of pitch is determined by using the energy values of several adjacent samples without any additional calculation, and in parallel with other criteria such as energy or zero crossing rate. Used for voice / non-voice judgment.

상술한 목적을 달성하기 위하여 본 발명은 프레임별 음성 / 비음성 판별을 이용한 실시간 음성 구간 검출 방법에 피치를 이용한 프레임의 음성 / 비음성 판단 과정을 추가한 구조로 이루어진다.In order to achieve the above object, the present invention has a structure in which a voice / non-voice determination process of a frame using pitch is added to a real-time voice section detection method using voice / non-voice discrimination for each frame.

아래에서, 본 발명에 따른 양호한 일 실시예를 첨부한 도면을 참조로 하여 상세히 설명하겠다.In the following, with reference to the accompanying drawings, a preferred embodiment according to the present invention will be described in detail.

도 1은 본 발명의 일 실시예에 따른 실시간 음성 검출 과정을 나타낸 흐름도로서, 이를 상세히 설명하면 다음과 같다.1 is a flowchart illustrating a real-time voice detection process according to an embodiment of the present invention.

먼저, 스텝 S100에서, 마이크 등의 입력 장치에 신호가 입력되면, 스텝 S101에서, 신호는 A/D 변환 장치에 의하여 아날로그 신호에서 디지털 신호로 변환된 후, 스텝 S102에서, 음성 처리의 단위인 적당한 길이를 갖는 프레임을 형성한다. 이후의 단계는 프레임 단위로 수행된다.First, in step S100, when a signal is input to an input device such as a microphone, in step S101, the signal is converted from an analog signal to a digital signal by an A / D converter, and then in step S102, a unit of speech processing is suitable. To form a frame having a length. Subsequent steps are performed frame by frame.

이어서, 스텝 S103에서, 입력 신호의 앞 부분의 프레임들을 이용하여 음성 검출에 사용되는 임계치 값들이 설정되었는지 여부를 판단하여, 설정이 되지 아니하였으면, 스텝 S104에서, 임계치 값들을 설정한 후, 상기 스텝 S100으로 복귀하고, 설정이 되어 있으면, 스텝 S105에서, 프레임별 음성 / 비음성을 판단한 후, 스텝 S106에서, 결정된 임계치 값들을 사용하여 이후에 입력되는 신호의 프레임들로부터 음성의 시작점이 검출되었는지 여부를 판단한다. 이 때, 일반적으로 실시간 음성 검출 방법에서 음성 검출에 사용되는 임계치들은 입력 신호의 초기 프레임들에서 구하는데 이것은 입력 신호의 앞부분에는 묵음이 존재한다는 가정하에 이루어진다.Subsequently, in step S103, it is determined whether or not the threshold values used for voice detection are set using the frames in front of the input signal. If not, in step S104, the threshold values are set, and then the step is set. Returning to S100 and if the setting is made, in step S105, after determining the voice / non-voice for each frame, in step S106, whether the starting point of the voice has been detected from the frames of the subsequent input signal using the determined threshold values. Judge. In this case, in general, the thresholds used for the voice detection in the real-time voice detection method are obtained in the initial frames of the input signal, which is made on the assumption that there is silence in front of the input signal.

상기 스텝 S106에서의 판단 결과, 검출이 되지 아니하였으면, 스텝 S107에서, 음성의 시작점을 검출한 후, 상기 스텝 S102로 복귀하고, 검출이 되었으면, 스텝 S108에서, 음성의 끝점이 검출되었는지 여부를 판단한다.If the result of the determination in step S106 is not detected, after detecting the start point of the voice in step S107, the process returns to the step S102, and if it is detected, it is determined in step S108 whether the end point of the voice has been detected. do.

상기 스텝 S108에서의 판단 결과, 검출이 되지 아니하였으면, 상기 스텝 S102로 복귀하고, 검출이 되었으면, 스텝 S109에서, 음성의 끝점을 검출한 후, 음성의 입력을 자동으로 중지한다.As a result of the determination in step S108, if no detection is made, the flow returns to step S102. If detection is made, in step S109, after detecting the end point of speech, the input of the speech is automatically stopped.

이러한 과정에서 본 발명이 적용되는 단계는 음성 검출을 위한 임계치 값을 결정하는 상기 스텝 S104와, 결정된 임계치 값을 사용하여 입력된 프레임을 음성 / 비음성으로 판단하는 상기 스텝 S105 단계이다.In this process, the step of applying the present invention is the step S104 of determining a threshold value for voice detection, and the step S105 of determining an input frame as voice / non-voice using the determined threshold value.

상기 두 단계에서 본 발명이 적용되는 경우를 상세히 살펴보면 다음과 같다.Looking at the case in which the present invention is applied in the above two steps as follows.

먼저, 본 발명에서 사용한 피치를 이용한 임계치를 'alpha'라고 한다. alpha 값은 임계치를 결정하기 위하여 사용되는 프레임들의 평균 에너지를 한 프레임을 구성하는 샘플 수로 나눈 값을 사용한다. 이렇게 구해진 프레임 내 샘플들의 평균 에너지 값에 가중치 w를 곱한 값을 alpha로 하며, 이는 아래의 [수학식 1]과 같이 수식화할 수 있다. 가중치 w는 반복된 실험에 의하여 결정된 값이다.First, the threshold using the pitch used in the present invention is referred to as 'alpha'. The alpha value is obtained by dividing the average energy of the frames used to determine the threshold by the number of samples constituting a frame. The value obtained by multiplying the average energy value of the samples in the frame by the weight w is alpha, which can be formulated as in Equation 1 below. The weight w is a value determined by repeated experiments.

도 2는 본 발명의 일 실시예에 따른 피치 임계치 alpha를 이용한 프레임의 음성 / 비음성 판별 과정을 나타낸 흐름도로서, 이를 상세히 설명하면 다음과 같다.2 is a flowchart illustrating a voice / non-voice discrimination process of a frame using a pitch threshold alpha according to an embodiment of the present invention.

위에서 계산된 alpha를 이용하여 이후에 입력되는 프레임이 음성 프레임인지 비음성 프레임인지를 판단하는 방법은 1 프레임 내에 존재하는 모든 샘플들을 순차적으로 alpha 값과 비교하는 방법을 사용한다.As a method of determining whether a subsequent input frame is a voice frame or a non-voice frame using alpha calculated above, all samples existing in one frame are sequentially compared with an alpha value.

먼저, 스텝 S201에서, 카운터 변수 Counter를 0로 지정하며, 현재 작업하고 있는 샘플이 몇번째 샘플인지를 나타내는 변수 Cur_S를 0로 지정하고, 스텝 S202에서, 샘플값이 입력되면, 스텝 S203에서, Cur_S = Cur_s + 1 의 연산을 수행한 후, 스텝 S204에서, 현재 프레임 내에 있는 샘플들의 총 개수를 나타내는 변수 Cur_S가 Total_S보다 큰지 여부를 판단한다.First, in step S201, the counter variable Counter is set to 0, and the variable Cur_S indicating the number of samples currently being worked is set to 0, and in step S202, when the sample value is input, in step S203, the Cur_S After performing the operation of = Cur_s + 1, in step S204, it is determined whether the variable Cur_S representing the total number of samples in the current frame is greater than Total_S.

상기 스텝 S204에서의 판단 결과, Cur_S가 Total_S보다 크면, 스텝 S210에서, 현재 프레임을 비음성 프레임으로 판단한 후, 종료하고, Total_S가 Cur_S보다 크면, 스텝 S205에서, 샘플값이 이미 계산한 alpha보다 큰지 여부를 판단한다.As a result of the determination in step S204, if Cur_S is greater than Total_S, in step S210, the current frame is judged to be a non-voice frame, and then, if Total_S is greater than Cur_S, if in step S205 the sample value is greater than the alpha already calculated Determine whether or not.

상기 스텝 S205에서의 판단 결과, 샘플값이 alpha보다 작거나 같으면, 스텝 S206에서, Counter를 다시 0으로 재지정한 후, 상기 스텝 S202로 복귀하고, 샘플값이 alpha보다 크면, 스텝 S207에서, Counter를 하나 증가시킨 후, 스텝 S208에서, Counter가 기설정한 n보다 큰지 여부를 판단한다.As a result of the determination in step S205, if the sample value is less than or equal to alpha, the counter is reset to 0 again in step S206, and then returned to step S202. If the sample value is greater than alpha, the counter is returned to step S207. After increasing by one, in step S208, it is determined whether or not the counter is larger than n preset.

상기 스텝 S208에서의 판단 결과, Counter가 n 보다 크면, 현재 프레임을 음성 프레임으로 판단한 후, 종료한다.As a result of the determination in step S208, if the counter is larger than n, the current frame is determined to be an audio frame and then terminated.

여기서 n 값은 A/D 시 사용되는 샘플링 주파수에 따라 달라지며, 보통 3 - 8 사이의 값으로 설정한다. 즉, 만일 n 값을 5로 설정하면, 한 프레임 내에서 alpha 값을 초과하는 샘플값들이 5 개 이상 연속하여 존재하는 경우, 그 프레임을 음성 프레임으로 판단하는 것이다.The n value depends on the sampling frequency used for A / D and is usually set between 3 and 8. That is, if n is set to 5, when five or more sample values exceeding the alpha value exist in one frame, the frame is determined as a voice frame.

alpha를 사용한 프레임의 음성 / 비음성 판단은 같이 사용되는 다른 임계치들을 이용한 판단과 함께 음성의 시작점을 결정하기 위한 중요한 자료로 사용된다.Speech / non-speech judgment of a frame using alpha is used as an important data for determining the starting point of speech along with judgment using other thresholds used together.

상기와 같은 본 발명은 컴퓨터로 읽을 수 있는 기록 매체로 기록되고, 컴퓨터에 의해 처리될 수 있다.The present invention as described above is recorded on a computer-readable recording medium, and can be processed by a computer.

앞서 상세히 설명한 바와 같이 본 발명은 프레임별 음성 / 비음성 판단 결과를 이용하여 실시간으로 음성 구간을 검출하는 방법에 있어서, 음성 구간에 존재하는 피치 정보를 매우 작은 계산량으로 음성 구간 검출을 위한 임계치로 추가로 사용함으로써, 노트북의 하드 디스크 구동 잡음, 발성자의 입술 소리 등 불규칙적으로 발생하는 비음성들과 같이 입력되는 신호에서 음성 구간만을 정확하게 검출할 수 있는 효과가 있다.As described in detail above, the present invention provides a method for detecting a speech section in real time using the speech / non-voice determination result for each frame, and adds pitch information existing in the speech section as a threshold for detecting the speech section with a very small amount of calculation. By using it, it is possible to accurately detect only a voice section from an input signal such as irregularly generated non-voices such as a hard disk drive noise of a notebook and a speaker's lips.

이상에서 본 발명에 대한 기술 사상을 첨부 도면과 함께 서술하였지만 이는 본 발명의 가장 양호한 일 실시예를 예시적으로 설명한 것이지 본 발명을 한정하는 것은 아니다. 또한, 이 기술 분야의 통상의 지식을 가진 자이면 누구나 본 발명의 기술 사상의 범주를 이탈하지 않는 범위 내에서 다양한 변형 및 모방이 가능함은 명백한 사실이다.The technical spirit of the present invention has been described above with reference to the accompanying drawings, but this is by way of example only and not by way of limitation to the present invention. In addition, it is obvious that any person skilled in the art may make various modifications and imitations without departing from the scope of the technical idea of the present invention.

Claims

A voice input unit to which a voice signal is input, an A / D converter for converting an analog voice signal input by the voice input unit into a digital voice signal, and a digital voice signal converted by the A / D converter In the speech detection device comprising a start point detection unit for detecting the start point of the voice at the end, and an end point detection unit for detecting the end point of the voice in the digital voice signal converted by the A / D converter,

The starting point detector,

Frame generation means for generating a frame having a predetermined length from the digital voice signal converted by the A / D converter;

Threshold setting means for extracting samples from frames earlier in the frame generated by the frame generating means and determining a threshold value of the pitch using only energy values of the extracted samples;

Detection means for detecting the frame as a voice start point if the number of samples in one frame exceeding the threshold values determined by the threshold value setting means is n or more predetermined;

Voice detection device comprising a.

The method of claim 1,

The threshold setting means,

An apparatus for detecting speech, comprising: determining a threshold alpha of a pitch using average energy of samples in a frame having a predetermined length as shown in [Equation 1] below.

[Equation 1]

Where w is a predetermined value by repeated experiments.

delete

The method according to claim 1 or 2,

The n value is set to an integer of any one of 3 to 8.

A voice input step of inputting a voice signal, an A / D conversion step of converting an analog voice signal input in the voice input step into a digital voice signal, and a digital voice signal converted in the A / D conversion step A voice detection method comprising: a start point detection step of detecting a start point of a voice from an end point; and an end point detection step of detecting an end point of a voice from a digital voice signal converted in the A / D conversion step.

The starting point detection step,

A first step of generating a frame having a predetermined length from the digital voice signal converted in the A / D conversion step;

A second step of extracting samples from frames earlier in the frame generated in the first step and determining a threshold of a pitch using only energy values of the extracted samples;

A third step of detecting the frame as a voice start point if the number of samples in one frame exceeding the threshold values determined in the second step is n or more predetermined;

Voice detection method comprising a.

The method of claim 5,

The second step,

A voice detection method characterized in that the threshold value of the pitch alpha is determined using the average energy of the samples in the frame of a predetermined length as shown in [Equation 2] below.

[Equation 2]

Where w is a predetermined value by repeated experiments.

delete

The method according to claim 5 or 6,

The n value is set to an integer of any one of 3 to 8.

On your computer,

A first step of inputting a voice signal;

A second step of converting the analog voice signal input in the first step into a digital voice signal;

A third step of generating a frame having a predetermined length from the digital voice signal converted in the second step;

A fourth step of extracting samples from the frame of the front part of the frame generated in the third step, and then determining a threshold of the pitch according to Equation 3 below;

And a fifth step of determining the frame as a voice frame when the number of samples exceeding the threshold value of the pitch in one frame determined in the fourth step is n or more predetermined. A computer-readable recording medium that has been recorded.

[Equation 3]

Where w is a predetermined value by repeated experiments.

The method of claim 9,

The fifth step,

And n is set to an integer of any one of 3 to 8. A computer-readable recording medium having recorded thereon a program.