KR20040082756A

KR20040082756A - Method for Speech Detection Using Removing Noise

Info

Publication number: KR20040082756A
Application number: KR1020030017408A
Authority: KR
Inventors: 장경애; 전호현
Original assignee: 주식회사 케이티
Priority date: 2003-03-20
Filing date: 2003-03-20
Publication date: 2004-09-30
Also published as: KR100574883B1

Abstract

PURPOSE: A speech extracting method through removal of non-voice is provided to remove non-voice such as noise and echo to efficiently extract speech. CONSTITUTION: An external speech signal is received and an estimated start point that is presumed as a voice is monitored and detected(201). It is judged whether a frame corresponding to the estimated start point satisfies a frame size boundary condition and an accumulation energy boundary condition(202). When the frame does not satisfy the conditions, the frame is decided as non-voice and removed(205). When the frame satisfies the conditions, the estimated start point is decided as a start point and an end point is detected to extract a speech signal(204).

Description

Speech extraction method using non-voice removal {Method for Speech Detection Using Removing Noise}

본 발명은, 비음성 제거에 의한 음성 추출 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a method for extracting speech by non-voice removal and a computer-readable recording medium having recorded thereon a program for realizing the method.

음성인식이란 인간의 음성에 포함되어 있는 언어정보를 추출하는 방법으로서, 마이크, 헤드셋, 전화기 등을 통하여 입력된 음성의 특징을 분석하여 특정한 동작을 수행하는 기술을 말한다. 이와 같은 음성인식기술은 실생활과 밀접한 관련이 있는 분야 즉, 홈 오토메이션, 음성인식 장난감, 음성인식 어학학습기, 음성인식 브라우저, 음성인식 게임, 음성인식 휴대통신단말기, 음성인식 가전제품, 증권거래시스템, 자동안내시스템 등 여러 분야에 걸쳐서 폭 넓게 활용되고 있다.Speech recognition is a method of extracting language information included in a human voice, and refers to a technology for performing a specific operation by analyzing a feature of a voice input through a microphone, a headset, a telephone, and the like. Such voice recognition technology is closely related to real life, such as home automation, voice recognition toy, voice recognition language learner, voice recognition browser, voice recognition game, voice recognition mobile communication terminal, voice recognition appliances, stock trading system, It is widely used in various fields such as automatic guidance system.

일반적으로, 음성인식은 크게 마이크, 헤드셋, 전화기 등을 통하여 음성을 입력받는 음성입력과정, 입력받은 음성에서 잡음을 제외한 음성만을 추출해내는 음성추출과정, 일종의 음성압축으로서 인간의 발성기관을 모델링하여 필터계수를 찾아내는 음성특징추출과정, 및 인식 알고리즘을 이용하여 음성을 인식하여 인식/오인식을 결정하는 음성인식과정을 거친다. 이 때, 음성입력과정에서 입력되는 음성신호에는 사용자의 발성에 의한 신호 즉, 음성이 포함되어 있지만, 잡음과 에코와 같은 비음성도 포함될 수 있다. 잡음은 발화자의 주변 환경에서 생길 수 있는 자동차 소음, 음악 소리, 전화망에서 생길 수 있는 잡음 신호 등을 의미하며, 에코는 음성인식을 이용한 응용서비스의 시나리오 안내멘트의 음성이 통신회로를 통하여 반향하는 신호를 의미한다.In general, voice recognition is largely a voice input process for receiving a voice through a microphone, a headset, a telephone, a voice extraction process for extracting only voices without noise from the input voice, and a type of voice compression to filter human speech organs. Speech feature extraction process for finding coefficients and speech recognition process for recognition / misrecognition by recognition of speech using recognition algorithm. At this time, the voice signal input during the voice input process includes a signal generated by the user's voice, that is, voice, but may also include non-voice such as noise and echo. Noise refers to car noise, music sound, and noise signal from telephone network that may occur in the surrounding environment of the speaker, and echo refers to a signal in which the voice of the scenario announcement of an application service using voice recognition echoes through a communication circuit. Means.

그런데, 음성인식에 있어서 일반적인 잡음은 발화자의 음성의 크기에 비해서 상대적으로 작기 때문에 에너지의 크기만을 검사하여 발화자의 음성을 추출하였다.그러나, 순간적으로 큰 에너지를 갖는 잡음(Sparking Noise)(예를 들어, 자동차 경적 소리)은 단위프레임의 에너지 크기가 순간적으로 크기 때문에 발화자의 음성과 함께 추출된다. 이와 같이 순간적으로 큰 에너지를 갖는 잡음(Sparking Noise)은 음성인식 알고리즘을 이용하여 제거할 수 있으나, 그로 인하여 음성인식기의 부하가 커지고 처리시간이 길어지는 문제점이 있었다.However, since the general noise in speech recognition is relatively small compared to the size of the talker's voice, only the amount of energy is examined to extract the talker's voice. However, sparking noise (e.g. , The car horn sound is extracted along with the talker's voice because the energy of the unit frame is instantaneously loud. As described above, sparking noise having a large amount of energy can be removed by using a speech recognition algorithm. However, there is a problem that the load of the speech recognizer is increased and the processing time is long.

또한, 에코는 발화자의 음성과 그 특징이 유사하므로 잡음모델을 활용한 음성인식 알고리즘으로는 음성과의 식별이 어려우며, 그로 인한 음성인식기의 성능저하가 발생하여 처리시간이 길어지는 문제점이 있었다. 그리고, 음성인식기의 플랫폼이 전용 하드웨어가 아닌 음성처리보드를 장착한 개인용 컴퓨터로 변화하면서 이러한 문제점을 해결할 수 있는 방법이 더욱 요구되고 있다.In addition, since echo has similar characteristics to the voice of the talker, it is difficult to identify the speech using a speech recognition algorithm using a noise model, and thus, a performance degradation of the speech recognizer occurs, resulting in a long processing time. In addition, as the platform of the voice recognizer is changed to a personal computer equipped with a voice processing board instead of dedicated hardware, a method for solving such a problem is required.

본 발명은, 상기와 같은 문제점을 해결하기 위하여 제안된 것으로, 음성인식시스템의 음성추출장치 등에서 잡음 또는 에코 등의 비음성을 제거하여 효율적으로 음성을 추출하기 위한, 비음성 제거에 의한 음성 추출 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems, and in order to efficiently extract speech by removing non-voice such as noise or echo in the speech extraction apparatus of the speech recognition system, the speech extraction method by non-voice removal And a computer readable recording medium having recorded thereon a program for realizing the method.

도 1은 본 발명이 적용되는 음성인식시스템의 일실시예 구성도.1 is a configuration diagram of an embodiment of a speech recognition system to which the present invention is applied.

도 2는 본 발명에 따른 비음성 제거에 의한 음성 추출 방법에 대한 일실시예 흐름도.Figure 2 is a flow diagram of an embodiment of a speech extraction method by non-voice removal according to the present invention.

도 3a 및 도 3b는 본 발명에 따른 비음성 제거에 의한 음성 추출 방법에 대한 구체적인 일예시도.Figure 3a and Figure 3b is a specific example of the speech extraction method by non-speech removal according to the present invention.

* 도면의 주요 부분에 대한 부호 설명* Explanation of symbols on the main parts of the drawing

10 : 음성인식시스템 11 : 음성입력부10: voice recognition system 11: voice input unit

12 : A/D 변환부 13 : 음성추출부12: A / D converter 13: voice extractor

14 : 음성인식부14: voice recognition unit

상기의 목적을 달성하기 위한 본 발명은, 비음성 제거에 의한 음성 추출 장치에 적용되는 음성 추출 방법에 있어서, 외부로부터 음성 신호를 입력받아 음성으로 추측되는 예상시작점을 감시하여 검출하는 제 1 단계; 상기 예상시작점에 해당하는 프레임이 음성 특징에 따른 프레임크기 경계조건과 누적에너지 경계조건을 만족하는지 판단하는 제 2 단계; 상기 제 2 단계의 판단 결과, 만족하지 않으면 해당 프레임을 비음성으로 결정하여 제거하는 제 3 단계; 및 상기 제 2 단계의 판단 결과, 만족하면 상기 예상시작점을 시작점으로 결정하고 끝점을 검출하여 음성신호를 추출하는 제 4 단계를 포함한다.According to an aspect of the present invention, there is provided a speech extraction method applied to an apparatus for extracting speech by non-speech, comprising: a first step of receiving and receiving a speech signal from an outside to monitor and detect an expected starting point estimated as speech; A second step of determining whether a frame corresponding to the expected starting point satisfies a frame size boundary condition and a cumulative energy boundary condition according to a voice feature; A third step of determining that the frame is non-voice and removing if not satisfied as a result of the determination of the second step; And a fourth step of determining the expected start point as a start point, detecting an end point, and extracting a voice signal when the determination result of the second step is satisfied.

또한, 본 발명은, 프로세서를 구비한 음성 추출 장치에, 외부로부터 음성 신호를 입력받아 음성으로 추측되는 예상시작점을 감시하여 검출하는 제 1 기능; 상기 예상시작점에 해당하는 프레임이 음성 특징에 따른 프레임크기 경계조건과 누적에너지 경계조건을 만족하는지 판단하는 제 2 기능; 상기 제 2 기능에서의 판단 결과, 만족하지 않으면 해당 프레임을 비음성으로 결정하여 제거하는 제 3 기능; 및 상기 제 2 기능에서의 판단 결과, 만족하면 상기 예상시작점을 시작점으로 결정하고 끝점을 검출하여 음성신호를 추출하는 제 4 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention also provides a speech extracting apparatus having a processor, comprising: a first function of receiving a speech signal from outside and monitoring and detecting an expected starting point estimated by speech; A second function of determining whether a frame corresponding to the expected starting point satisfies a frame size boundary condition and a cumulative energy boundary condition according to a voice feature; A third function of determining and removing the corresponding frame as non-voice if it is not satisfied as a result of the determination in the second function; And a computer-readable recording medium having recorded thereon a program for realizing the fourth function of determining the expected starting point as a starting point, detecting the end point, and extracting a voice signal if the determination result of the second function is satisfied.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명이 적용되는 음성인식시스템의 일실시예 구성도이다.1 is a configuration diagram of an embodiment of a speech recognition system to which the present invention is applied.

도 1에 도시된 바와 같이, 본 발명이 적용되는 음성인식시스템(10)은, 외부로부터 음성을 입력받기 위한 음성입력부(11), 상기 음성입력부(11)로부터 입력받은 아나로그 음성 신호를 디지털 음성 신호로 변환시키기 위한 A/D 변환부(12), 상기 A/D 변환부(12)로부터 디지털 신호로 변환된 음성 신호를 입력받아 잡음과 에코 등의 비음성을 제거하여 음성을 추출하기 위한 음성추출부(13) 및 상기 음성추출부(13)로부터 추출된 음성을 입력받아 음성특징을 추출하고 음성인식 알고리즘에 의하여 음성을 인식하기 위한 음성인식부(14)를 포함한다. 이 때, 본 발명에 따른 음성 추출 방법은 상기 음성추출부(13)에서 수행된다.As shown in FIG. 1, the voice recognition system 10 to which the present invention is applied includes a voice input unit 11 for receiving voice from the outside and an analog voice signal received from the voice input unit 11. A / D converter 12 for converting into a signal, the voice to extract the voice by removing the non-voice, such as noise and echo received from the audio signal converted into a digital signal from the A / D converter 12 It includes an extraction unit 13 and a speech recognition unit 14 for receiving the extracted voice from the speech extraction unit 13 to extract the speech features and to recognize the speech by a speech recognition algorithm. At this time, the speech extraction method according to the present invention is performed by the speech extraction unit 13.

상기 음성추출부(13)는 상기 A/D 변환부(12)로부터 디지털 신호로 변환된 음성 신호를 입력받아 잡음과 에코 등의 비음성을 제거한다. 이 때, 순간적으로 큰 에너지를 갖는 잡음(Sparking Noise)은 단위프레임 에너지가 크기 때문에 누적 에너지 경계 조건을 만족하지만, 제한적인 프레임 크기를 가지게 되므로 이를 이용하여 음성신호와 구별한다. 또한, 에코는 음성과 유사하여 프레임 크기 경계 조건을 만족하지만, 에너지 누적값이 음성신호와 구별되므로 이를 이용하여 음성신호와 구별한다. 이와 같이 상기 음성추출부(13)에서 비음성을 최대한 제거해 주므로, 상기 음성인식부(14)의 부하가 줄어들어 결과적으로 음성처리 성능을 향상시킬 수 있는 효과가 있다.The voice extractor 13 receives a voice signal converted into a digital signal from the A / D converter 12 and removes non-voice such as noise and echo. At this time, the sparking noise having instantaneous large energy satisfies the cumulative energy boundary condition because the unit frame energy is large, but has a limited frame size to distinguish it from the voice signal. In addition, the echo is similar to the speech and satisfies the frame size boundary condition, but since the energy accumulation value is distinguished from the speech signal, the echo is used to distinguish it from the speech signal. As such, since the voice extractor 13 removes the non-voice as much as possible, the load of the voice recognition unit 14 is reduced, and as a result, the voice processing performance can be improved.

도 2는 본 발명에 따른 비음성 제거에 의한 음성 추출 방법에 대한 일실시예 흐름도이다.2 is a flowchart illustrating a method for extracting speech by non-voice removal according to the present invention.

본 발명에 따른 비음성 제거에 의한 음성 추출 방법은, 크게 음성으로 예상할 수 있는 최초 프레임을 감시하는 과정, 프레임크기 경계조건과 누적에너지 경계조건에 따라 음성과 비음성을 판단하는 과정, 및 비음성으로 판단된 경우에 프레임을 감시하는 과정부터 반복 수행하는 과정으로 구성된다.In the speech extraction method using non-voice removal according to the present invention, a process of monitoring an initial frame that can be expected to be largely speech, a process of determining speech and non-voice according to frame size boundary conditions and cumulative energy boundary conditions, and In the case where it is determined that the voice is determined, the process includes monitoring the frame and repeating the process.

먼저, 외부로부터 음성을 입력받아 음성으로 예상할 수 있는 최초 프레임 즉, 예상시작점 결정조건을 만족하는 프레임이 있는지 감시하여 예상시작점을 검출한다(201). 예상시작점이 검출되면, 상기 검출된 예상시작점에 해당하는 프레임이 음성 특징에 따른 프레임크기 경계조건과 누적에너지 경계조건을 만족하는지 판단한다(202). 상기 판단 결과(202), 음성 특징을 만족하면 상기 예상시작점을 시작점으로 결정하고 끝점을 검출하여 음성신호를 추출한다(204). 상기 판단 결과(202), 음성 특징을 만족하지 않으면 해당 프레임을 비음성으로 결정하여 제거하고 초기화를 수행한다(205). 즉, 순간적으로 큰 에너지를 갖는 잡음(Sparking Noise)의 경우에는 누적에너지 경계조건은 만족하지만 프레임크기 경계조건을 만족하지 않고, 에코의 경우에는 프레임크기 경계조건은 만족하지만 누적에너지 경계조건을 만족하지 않으므로, 음성 특징을 만족하지 않는 비음성으로 판단되어 제거된다. 이후 "201" 과정으로 진행하여 음성을 추출하기 위한 과정을 재개한다.First, an expected starting point is detected by receiving a voice from the outside and monitoring whether there is a first frame that can be predicted as a voice, that is, a frame satisfying an expected starting point determining condition (201). When the expected starting point is detected, it is determined whether the frame corresponding to the detected starting point satisfies the frame size boundary condition and the cumulative energy boundary condition according to the speech feature (202). As a result of the determination 202, if the voice characteristic is satisfied, the expected start point is determined as the start point, and the end point is detected to extract the voice signal (204). As a result of the determination 202, if the voice characteristic is not satisfied, the frame is determined to be non-voice, removed and initialized (205). That is, in the case of sparking noise, the cumulative energy boundary condition is satisfied but the frame size boundary condition is not satisfied, and in the case of echo, the frame size boundary condition is satisfied but the cumulative energy boundary condition is not satisfied. Therefore, it is determined that the voice feature does not satisfy the voice feature and is removed. After that, the process proceeds to “201” and the process for extracting the voice is resumed.

도 3a 및 도 3b는 본 발명에 따른 비음성 제거에 의한 음성 추출 방법에 대한 구체적인 일예시도이다.3a and 3b is a specific example of the voice extraction method by non-voice removal according to the present invention.

우선, 각 스테이트에서의 비교 조건을 살펴보면, 스테이트1(308)은 음성신호데이터의 프레임카운터(Env_Count)값이 미리 설정된 프레임 길이(ENV_LEN)값 보다 작은지를 비교한다. 스테이트2(310)는 입력된 음성단위프레임의 에너지가 기준값으로 설정된 에너지 하한값(ThreshEL)보다 작은지를 비교한다. 그리고,스테이트3(314)과 스테이트4(316)에서는 입력된 음성단위프레임의 에너지가 기준값으로 설정된 에너지 상한값(ThreshEH)보다 큰지를 비교한다. 마지막으로, 스테이트5(321)는 음성단위프레임의 영교차율(ZCR : Zero Crossing Rate) 값이 미리 설정된 영교차율(ZCR : Zero Crossing Rate) 값보다 작은지를 비교한다. 이 때, 영교차율(ZCR)이란 인접한 신호의 부호가 서로 다른 횟수를 말하며, 이는 그 신호의 주파수 정보를 포함한 값으로서, 일반적으로 유성음에서는 낮은 값, 무성음에서는 높은 값을 가진다. 이하의 설명에서는 본 발명의 일실시예에 따른 비음성 제거에 의한 음성 추출 방법에 대한 구체적인 실시예를 살펴보기로 한다.First, referring to the comparison conditions in each state, state 1 308 compares whether the frame counter (Env_Count) value of the audio signal data is smaller than the preset frame length (ENV_LEN) value. The state 2 310 compares whether the energy of the input speech unit frame is smaller than the lower energy limit ThreshEL set as a reference value. The state 3 314 and the state 4 316 compare whether the energy of the input voice unit frame is greater than the upper energy threshold ThreshEH set as a reference value. Finally, state 5 321 compares whether a zero crossing rate (ZCR) value of a voice unit frame is smaller than a preset zero crossing rate (ZCR) value. In this case, the zero crossing rate (ZCR) refers to the number of times that the signals of adjacent signals are different from each other. This value includes frequency information of the signal, and generally has a low value for voiced sound and a high value for unvoiced sound. In the following description, a specific embodiment of a voice extraction method by non-voice removal according to an embodiment of the present invention will be described.

먼저, 음성추출부(13)가 음성데이터를 입력받아(301) 미처리 데이터인지를 확인하여(302) 미처리 데이터이면 예상시작점이 존재하는지 확인한다(303). 상기 확인 결과(303), 예상시작점이 존재하면 음성특징 또는 비음성특징을 만족하는지 판단하여 음성특징을 만족하면 시작점을 결정하고, 비음성특징을 만족하면 시작점 찾기를 재개한다(304 내지 307). 이 때, 시작점이 음성특징 및 비음성특징을 모두 만족하지 않을 때에는 음성데이터의 최초프레임부터 일정구간을 잡음특성구간으로 설정하여 기준값을 초기화한다(308, 309).First, the voice extracting unit 13 receives the voice data (301) and checks whether it is raw data (302), and if it is raw data, checks whether an expected starting point exists (303). As a result of the check 303, if the expected start point exists, it is determined whether the voice feature or the non-voice feature is satisfied, and if the voice feature is satisfied, the start point is determined, and if the non-voice feature is satisfied, the start point search is resumed (304 to 307). At this time, when the starting point does not satisfy both the voice feature and the non-voice feature, the reference value is initialized by setting a predetermined period as the noise feature section from the first frame of the voice data (308, 309).

이후, 스테이트2(310)에서 음성신호데이터의 단위프레임에너지가 기준값으로 설정된 에너지 하한값(ThreshEL) 미만인지 비교하여, 단위프레임에너지가 에너지 하한값(ThreshEL) 미만이고 시작점이 존재하면 끝점 결정 조건을 만족하는 끝점을 검출하여 음성신호 추출을 완료한다(310 내지 313).Thereafter, the state 2 310 compares whether the unit frame energy of the voice signal data is lower than the lower energy threshold ThrEL set as a reference value, and if the unit frame energy is lower than the lower energy threshold ThrEL and the start point exists, the end point determination condition is satisfied. The end point is detected and voice signal extraction is completed (310 to 313).

한편, 스테이트2(310)에서 음성신호데이터의 단위프레임에너지가 에너지 하한값(ThreshEL) 이상이면 스테이트3(314), 스테이트4(316)에서 음성신호데이터의 단위프레임에너지가 기준값으로 설정된 에너지 상한값(ThreshHL) 이상인지 비교하여, 단위프레임에너지가 에너지 상한값 이상이면 예상시작점 결정조건을 만족하는 예상 시작점을 결정하고, 비음성 구간을 제거하기 위한 준비를 한다(316 내지 320).On the other hand, if the unit frame energy of the voice signal data in state 2 310 is greater than or equal to the lower energy threshold ThreshEL, the upper energy limit (ThreshHL) in which the unit frame energy of the voice signal data is set to reference value in state 3 314 and state 316. ), And if the unit frame energy is equal to or higher than the upper energy limit, an expected starting point that satisfies the expected starting point determination condition is determined, and preparations are made to remove the non-voice section (316 to 320).

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.The method of the present invention as described above may be implemented as a program and stored in a computer-readable recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.). Since this process can be easily implemented by those skilled in the art will not be described in more detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

상기와 같이 본 발명은, 순간적으로 큰 에너지를 갖는 잡음(Sparking Noise)과 에코를 제거하는 방법을 제공함으로써 비음성 신호를 처리하는 데서 발생하는 부하를 줄일 수 있으며, 그에 따라 음성인식처리 성능을 향상시킬 수 있는 효과가 있다.As described above, the present invention can reduce the load incurred in processing non-voice signals by providing a method of removing sparking noise and echoes having a large amount of energy in an instant, thereby improving speech recognition performance. It can be effected.

Claims

In the speech extraction method applied to the speech extraction apparatus by non-voice removal,

A first step of receiving and receiving a voice signal from the outside to monitor and detect an expected starting point estimated by the voice;

A second step of determining whether a frame corresponding to the expected starting point satisfies a frame size boundary condition and a cumulative energy boundary condition according to a voice feature;

A third step of determining that the frame is non-voice and removing if not satisfied as a result of the determination of the second step; And

A fourth step of determining the expected start point as a start point and extracting a voice signal by detecting an end point if it is satisfied as a result of the determination of the second step;

Speech extraction method by non-voice removal comprising a.

In a speech extraction device having a processor,

A first function of receiving a voice signal from the outside and monitoring and detecting an expected starting point estimated by the voice;

A second function of determining whether a frame corresponding to the expected starting point satisfies a frame size boundary condition and a cumulative energy boundary condition according to a voice feature;

A third function of determining and removing the corresponding frame as non-voice if it is not satisfied as a result of the determination in the second function; And

A fourth function of determining the expected start point as a start point and detecting an end point and extracting a voice signal when the determination result of the second function is satisfied;

A computer-readable recording medium having recorded thereon a program for realizing this.