KR20120098211A

KR20120098211A - Method for voice recognition and apparatus for voice recognition thereof

Info

Publication number: KR20120098211A
Application number: KR1020110018012A
Authority: KR
Inventors: 조재연
Original assignee: 삼성전자주식회사
Priority date: 2011-02-28
Filing date: 2011-02-28
Publication date: 2012-09-05
Also published as: KR101811716B1

Abstract

PURPOSE: A voice recognizing method and a device thereof are provided to quickly and accurately extract a voice section. CONSTITUTION: A target signal extracting unit(135) outputs a target voice signal. A control unit(140) calculates first power of the target voice signal and second power of the target voice signal which does not pass through the target signal extracting unit. The control unit calculate the ratio of the first power to the second power. The control unit detects a voice section. A voice recognizing unit(150) recognizes the voice signal which exists in the voice section. [Reference numerals] (110) Voice input unit; (130) Voice recognizing preprocessing unit; (135) Target signal extracting unit; (140) Control unit; (150) Voice recognizing unit; (AA) Voice signal

Description

Speech recognition method and a speech recognition device according thereto {METHOD FOR VOICE RECOGNITION AND APPARATUS FOR VOICE RECOGNITION THEREOF}

본 발명은 음성 인식 방법 및 그에 따른 음성 인식 장치에 관한 것으로, 더욱 상세하게는 음성 인식율을 향상시킬 수 있는 음성 인식 방법 및 그에 따른 음성 인식 장치에 관한 것이다. The present invention relates to a voice recognition method and a voice recognition device according to the above, and more particularly, to a voice recognition method and a voice recognition device according to it can improve the voice recognition rate.

음성 인식 기술은 사용자 등이 입력하는 음성 신호를 소정 언어에 대응되는 신호로 인식하는 기술로, 예를 들어, 음성 인식 리모컨과 같이 소정 전자기기의 동작 제어 등을 위하여 이용될 수 있다. The voice recognition technology is a technology for recognizing a voice signal input by a user or the like as a signal corresponding to a predetermined language. For example, the voice recognition technology may be used for operation control of a predetermined electronic device, such as a voice recognition remote controller.

음성 인식을 위해서는, 우선 인식 대상이 되는 음성 신호의 구간을 추출하여야 한다. 여기서, 음성 인식을 위하여 인식 대상이 되는 음성 신호가 포함되어 있는 신호 구간을 추출하는 단계를 음성 인식 전처리 단계라 한다.In order to perform speech recognition, first, a section of a speech signal to be recognized must be extracted. Here, the step of extracting a signal section including the voice signal to be recognized for speech recognition is called a speech recognition preprocessing step.

또한, 음성 인식의 인식율 향상을 위하여, 입력된 음성 신호에 섞여 있는 잡음을 제거하여 순수한 음성 신호를 추출하는 기술인 음성 향상 기술(speech enhancement)이 음성 인식 전처리 단계에서 이용될 수 있다. 음성 향상 기술은 세부적으로, 정적 잡음을 제거하는 잡음 억제(noise suppression), 잡음과 음성 신호가 섞이는 과정을 역으로 처리하는 신호원 분리(source separation), 잡음의 방향이 원하는 음성 신호의 방향과 다르다고 가정하고 소정 방향에 따라서 신호를 필터링하는 마이크로 폰 배열 처리(microphone array processing) 등을 예로 들 수 있다. Also, in order to improve the recognition rate of speech recognition, a speech enhancement technique, which is a technique of extracting a pure speech signal by removing noise mixed in an input speech signal, may be used in a speech recognition preprocessing step. Speech enhancement techniques specifically include noise suppression, which eliminates static noise, source separation, which reverses the process of mixing noise and speech signals, and that the direction of the noise is different from the direction of the desired speech signal. Assume, for example, microphone array processing for filtering a signal according to a predetermined direction.

여기서, 음성 신호가 정확히 어느 구간에 존재하는지 알고서 음성 신호 인식을 위한 처리를 수행한다면, 잡음 제거를 더욱 효과적으로 수행할 수 있으며 그에 따라서 음성 인식의 정확성 또한 향상시킬 수 있다. In this case, if the speech signal is processed to know exactly where it is, processing for speech signal recognition can be performed more effectively to remove noise, thereby improving the accuracy of speech recognition.

따라서, 음성 인식율의 향상을 위해서는 음성 인식 전 처리 단계에서 음성 신호가 존재하는 구간인 음성 구간을 정확하게 검출할 필요가 있다. Therefore, in order to improve the speech recognition rate, it is necessary to accurately detect the speech section which is the section in which the speech signal exists in the pre-speech speech processing step.

본 발명의 일 실시예에 따른 음성 인식 방법 및 그에 따른 음성 인식 장치는 잡음이 혼재하는 환경에서도 인식 대상이 되는 음성 구간을 빠르고 정확하게 추출할 수 있는 음성 인식 방법 및 그에 따른 음성 인식 장치의 제공을 목적으로 한다. A voice recognition method and a voice recognition device according to an embodiment of the present invention are to provide a voice recognition method and a voice recognition device that can quickly and accurately extract the speech section to be recognized even in a mixed environment of noise It is done.

또한, 본 발명의 일 실시예에 따른 음성 인식 방법 및 그에 따른 음성 인식 장치는 음성 인식의 정확성을 높여 음성 인식율을 향상시킬 수 있는 음성 인식 방법 및 그에 따른 음성 인식 장치의 제공을 목적으로 한다. In addition, a voice recognition method and a voice recognition device according to an embodiment of the present invention is to provide a voice recognition method and a voice recognition device that can improve the voice recognition rate by increasing the accuracy of the voice recognition.

본 발명의 일 실시예에 따른 음성 인식 장치는 적어도 하나의 음성 신호를 입력받는 음성 입력부, 상기 음성 신호에서 인식 대상인 목적 음성 성분을 추출하여 목적 음성 신호를 출력하는 목표 신호 추출부, 상기 목적 음성 신호의 파워인 제1 파워와 상기 목표 신호 추출부를 통과하지 않은 상기 음성 신호의 파워인 제2 파워를 산출하고, 상기 제2 파워와 상기 제1 파워의 비율을 산출하며, 상기 비율에 근거하여 상기 목적 음성 성분이 포함된 구간인 음성 구간을 검출하는 제어부, 및 상기 음성 구간에 존재하는 상기 음성 신호를 인식하는 음성 인식부를 포함한다. According to an embodiment of the present invention, an apparatus for recognizing a voice may include a voice input unit receiving at least one voice signal, a target signal extractor configured to extract a target voice component to be recognized from the voice signal, and output a target voice signal, wherein the target voice signal is output. Calculating a first power that is a power of a second power and a second power that is a power of the voice signal that does not pass through the target signal extracting unit, and calculates a ratio of the second power and the first power, based on the ratio; And a controller for detecting a voice section, the voice section including a voice component, and a voice recognition unit recognizing the voice signal existing in the voice section.

바람직하게, 목표 신호 추출부는 상기 목적 음성 성분을 빔포밍시키고, 상기 빔포밍된 목적 음성 성분을 상기 목적 음성 신호로써 출력할 수 있다. Preferably, the target signal extractor may beamform the target speech component and output the beamformed target speech component as the target speech signal.

바람직하게, 제어부는 상기 제2 파워 대비 상기 제1 파워의 비율(제1 파워/ 제2 파워)을 산출하며, 상기 비율이 소정 문턱 값 이상 또는 초과되면, 상기 음성 구간으로 판단할 수 있다. Preferably, the controller calculates a ratio (first power / second power) of the first power to the second power, and when the ratio is equal to or greater than a predetermined threshold value, may determine the voice interval.

바람직하게, 제어부는 적어도 하나의 프레임 단위로 상기 제1 파워 및 상기 제2 파워를 산출하며, 상기 적어도 하나의 프레임 단위로 상기 음성 구간인지 여부를 판단할 수 있다. Preferably, the controller calculates the first power and the second power in at least one frame unit, and determines whether the voice section is in the at least one frame unit.

바람직하게, 제어부는 상기 제2 파워 대비 상기 제1 파워의 비율(제1 파워/ 제2 파워)이 상기 소정 문턱 값 미만 또는 이하이면, 상기 음성 구간의 끝점으로 판단할 수 있다. Preferably, the controller may determine the end point of the voice section when the ratio (first power / second power) of the first power to the second power is less than or less than the predetermined threshold.

바람직하게, 상기 제2 파워 대비 상기 제1 파워의 비율은 로그 스케일로 산출될 수 있다. Preferably, the ratio of the first power to the second power may be calculated on a logarithmic scale.

바람직하게, 상기 음성 입력부는 적어도 하나의 마이크로폰을 포함하여, 상기 적어도 하나의 마이크로폰을 통해 상기 적어도 하나의 음성 신호를 입력받는 마이크로폰 어레이를 포함할 수 있다. Preferably, the voice input unit may include a microphone array including at least one microphone and receiving the at least one voice signal through the at least one microphone.

또한, 본 발명의 일 실시예에 따른 음성 인식 장치는 상기 적어도 하나의 음성 신호들을 동기화시켜 출력하는 시간 축 정렬부, 상기 시간 축 정렬부에서 출력되는 상기 음성 신호에서 상기 목적 음성 성분을 차단하는 목표 신호 차단부, 및 상기 목표 신호 차단부에서 출력되는 신호의 파워가 최소가 되도록 적응 필터의 계수를 갱신시키고, 상기 목표 신호 차단부의 출력 신호에 대해 상기 적응 필터의 계수를 적용하여 적응 필터링을 수행하는 적응 필터부를 더 포함할 수 있다. In addition, the voice recognition apparatus according to an embodiment of the present invention is a time axis alignment unit for synchronizing and outputting the at least one voice signal, the target to block the target voice component in the voice signal output from the time axis alignment unit Performing adaptive filtering by updating the coefficients of the adaptive filter so that the power of the signal cutout and the signal output from the target signal cutoff is minimized, and applying the coefficients of the adaptive filter to the output signal of the target signal cutoff; The apparatus may further include an adaptive filter unit.

바람직하게, 제어부는 상기 음성 구간의 끝점이 검출되면, 상기 음성 구간 이외의 구간에 적용되는 적응 필터 계수가 갱신 되도록 상기 적응 필터부를 제어할 수 있다. Preferably, when the end point of the speech section is detected, the controller may control the adaptive filter to update the adaptive filter coefficients applied to the sections other than the speech section.

또한, 본 발명의 일 실시예에 따른 음성 인식 장치는 상기 목적 음성 신호 에서 상기 적응 필터부의 출력 신호를 차감시켜 출력하는 신호 차감부를 더 포함할 수 있다. In addition, the speech recognition apparatus according to an exemplary embodiment may further include a signal subtractor configured to subtract and output an output signal of the adaptive filter unit from the target speech signal.

바람직하게, 제어부는 상기 시간 축 정렬부의 출력 신호, 상기 목표 신호 차단부의 출력 신호, 및 상기 적응 필터부의 출력 신호 중 어느 하나의 신호의 파워를 상기 제2 파워로써 산출할 수 있다. Preferably, the controller may calculate the power of any one of the output signal of the time axis alignment unit, the output signal of the target signal blocking unit, and the output signal of the adaptive filter unit as the second power.

바람직하게, 제어부는 상기 음성 구간의 검출이 완료되면, 상기 음성 구간 및 상기 음성 구간 이외의 구간에서 차등적으로 잡음 제거를 수행하도록 요청하는 제어 신호를 상기 음성 인식부로 출력할 수 있다. Preferably, when the detection of the speech section is completed, the controller may output a control signal to the speech recognition unit to request to remove noise differentially in the speech section and the section other than the speech section.

본 발명의 일 실시예에 따른 음성 인식 방법은 적어도 하나의 음성 신호를 입력받는 단계, 상기 음성 신호에서 인식 대상인 목적 음성 성분을 추출하여 목적 음성 신호를 출력하는 단계, 상기 목적 음성 신호의 파워인 제1 파워와 상기 음성 신호의 파워인 제2 파워를 산출하는 단계, 상기 제2 파워와 상기 제1 파워의 비율을 산출하며, 상기 비율에 근거하여 상기 목적 음성 성분이 포함된 구간인 음성 구간을 검출하는 단계, 및 상기 음성 구간에 존재하는 상기 음성 신호를 인식하는 단계를 포함한다. The voice recognition method according to an embodiment of the present invention comprises the steps of: receiving at least one voice signal, extracting a target voice component to be recognized from the voice signal, and outputting a target voice signal; Calculating a second power that is a first power and a power of the voice signal, calculating a ratio of the second power and the first power, and detecting a voice section that is a section including the target voice component based on the ratio And recognizing the voice signal existing in the voice section.

도 1은 본 발명의 일 실시예에 따른 음성 인식 장치의 블록 다이어그램이다.
도 2는 도 1의 음성 인식 장치를 더욱 상세하게 나타내는 블록 다이어그램이다.
도 3은 본 발명의 일 실시예에 따른 음성 인식 방법을 나타내는 흐름도이다.
도 4는 도 3의 음성 인식 방법을 더욱 상세하게 나타내는 흐름도이다. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating the speech recognition apparatus of FIG. 1 in more detail.
3 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention.
4 is a flowchart illustrating the speech recognition method of FIG. 3 in more detail.

이하, 첨부된 도면을 참조하여 본 발명에 따른 음성 인식 방법 및 그에 따른 음성 인식 장치에 대하여 상세히 설명한다. Hereinafter, a voice recognition method and a voice recognition apparatus according to the present invention will be described in detail with reference to the accompanying drawings.

음성 인식 장치는 마이크로폰(microphone) 등의 음성 입력 장치 등을 통하여 입력되는 음성신호를 입력받는다. 그리고, 입력받은 음성 신호에서 사용자 등이 입력하고자 했던 음성 신호의 구간인 목적 음성 구간을 추출하고, 검출된 목적 음성 구간에 존재하는 잡음 제거 처리 등을 수행하며, 최종적으로 음성 신호에 대응되는 단어 또는 명령을 판별 또는 인식한다. 나아가, 인식된 단어 또는 명령에 대응되는 소정 동작을 수행할 수도 있다. The voice recognition device receives a voice signal input through a voice input device such as a microphone. In addition, the target voice section, which is a section of the voice signal that the user or the user wants to input, is extracted from the received voice signal, and the noise removing process existing in the detected target voice section is performed, and finally a word corresponding to the voice signal or Determine or recognize the command. Furthermore, a predetermined operation corresponding to the recognized word or command may be performed.

전술한 바와 같이, 음성 신호에 대응되는 단어 또는 명령의 인식 동작(이하, '음성 인식 동작')에 앞서, 음성 향상 기술들을 적용한 음성 인식 전처리 동작을 수행할 수 있다. As described above, the voice recognition preprocessing operation using voice enhancement techniques may be performed before the recognition operation of the word or command corresponding to the voice signal (hereinafter, referred to as the voice recognition operation).

도 1은 본 발명의 일 실시예에 따른 음성 인식 장치의 블록 다이어그램이다. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 음성 인식 장치(100)는 음성 입력부(110), 음성 인식 전처리부(130), 및 음성 인식부(150)를 포함한다. Referring to FIG. 1, the voice recognition apparatus 100 according to an exemplary embodiment of the present invention includes a voice input unit 110, a voice recognition preprocessor 130, and a voice recognition unit 150.

음성 입력부(110)는 사용자 등에 의하여 생성된 음성 신호를 입력받는다. The voice input unit 110 receives a voice signal generated by a user or the like.

음성 인식 전처리부(130)는 입력받은 음성 신호에서 사용자 등이 입력하고 했던 음성 신호의 구간인 음성 구간을 검출한다. The speech recognition preprocessor 130 detects a speech section that is a section of the speech signal inputted by the user or the like from the received speech signal.

음성 인식 전처리부(130)는 음성 인식 전처리를 위하여 적응 필터(adaptive filter)를 이용한 적응 모드 제어(adaptive mode control)를 수행할 수 있다. 이하의 도 2에서는, 음성 인식 전처리부(130)에서 적응 필터 부(245)를 구비하여, 적응 모드 제어를 수행하는 경우를 예로 들어 설명한다. The speech recognition preprocessor 130 may perform adaptive mode control using an adaptive filter for speech recognition preprocessing. In FIG. 2, the case where the speech recognition preprocessor 130 includes the adaptive filter unit 245 and performs adaptive mode control will be described as an example.

또한, 음성 인식 전처리부(130)는 목표 신호 추출부(135) 및 제어부(140)를 포함한다. 여기서, 목표 신호 추출부(135) 및 제어부(140)는 도 2에서 후술할 목표 신호 추출부(235) 및 제어부(240)와 동일 대응된다. In addition, the speech recognition preprocessor 130 includes a target signal extractor 135 and a controller 140. Here, the target signal extractor 135 and the controller 140 correspond to the target signal extractor 235 and the controller 240 which will be described later with reference to FIG. 2.

목표 신호 추출부(135)는 음성 입력부(110)를 통해 입력된 음성 신호를 전송받고, 입력된 음성 신호에서 인식 대상인 목적 음성 성분을 추출하여 목적 음성 신호를 출력한다. 즉, 입력된 음성 신호 중 목적 음성 성분을 추출하여 목적 음성 신호로써 출력하는 것이다. The target signal extractor 135 receives a voice signal input through the voice input unit 110, extracts a target voice component to be recognized from the input voice signal, and outputs a target voice signal. That is, the target voice component is extracted from the input voice signal and output as the target voice signal.

목표 신호 추출부(135)는 빔포밍(beamforming) 동작을 수행하는 빔포머(beamformer)로 이뤄질 수 있다. 이 경우, 목표 신호 추출부(135)는 입력된 음성 신호를 빔포밍시키며, 빔포밍 된 음성 신호를 목적 음성 신호로써 출력할 수 있다. 구체적으로, 목표 신호 추출부(135)는 고정 빔포밍(beamforming) 동작을 수행하는 고정 빔포머(beamformer)로 이뤄질 수 있다. The target signal extractor 135 may be a beamformer that performs a beamforming operation. In this case, the target signal extractor 135 beamforms the input voice signal, and outputs the beamformed voice signal as the target voice signal. In detail, the target signal extractor 135 may be a fixed beamformer that performs a fixed beamforming operation.

제어부(140)는 목적 음성 신호의 파워인 제1 파워와 목표 신호 추출부(135)를 통과하지 않은 음성 신호의 파워인 제2 파워를 산출하고, 제2 파워와 제1 파워의 비율을 산출하며, 산출된 비율에 근거하여 목적 음성 성분이 포함된 구간인 음성 구간을 검출한다. The controller 140 calculates a first power that is the power of the target voice signal and a second power that is the power of the voice signal that does not pass through the target signal extractor 135, and calculates a ratio of the second power and the first power. Based on the calculated ratio, a speech section that is a section including the target speech component is detected.

음성 인식부(150)는 음성 인식 전처리부(130)에서 검출된 음성 구간에 존재하는 단어 또는 명령을 인식한다. The speech recognition unit 150 recognizes a word or a command existing in the speech section detected by the speech recognition preprocessor 130.

영상 인식 장치(100)의 상세 구성 및 동작은 이하에서 도 2를 참조하여 상세히 설명한다. Detailed configuration and operation of the image recognition apparatus 100 will be described in detail with reference to FIG. 2 below.

도 2는 도 1의 음성 인식 장치를 더욱 상세하게 나타내는 블록 다이어그램이다. FIG. 2 is a block diagram illustrating the speech recognition apparatus of FIG. 1 in more detail.

도 2를 참조하면, 음성 입력부(110)는 마이크로폰 어레이(microphone array)(210)를 포함할 수 있다. 마이크로폰 어레이는 2개 이상의 마이크로폰을 포함하여, 각각의 마이크로폰을 통하여 음성 신호를 입력받는다. 따라서, 음성 입력부(110)는 적어도 하나 이상의 음성 신호를 입력받을 수 있다. Referring to FIG. 2, the voice input unit 110 may include a microphone array 210. The microphone array includes two or more microphones to receive a voice signal through each microphone. Therefore, the voice input unit 110 may receive at least one voice signal.

음성 인식 전처리부(130)는 목표 신호 추출부(235) 및 제어부(240)를 포함할 수 있다. 그리고, 시간 축 정렬부(231), 적응 필터부(245), 목표 신호 차단부(250), 및 신호 차감부(255)를 더 포함할 수 있다. The speech recognition preprocessor 130 may include a target signal extractor 235 and a controller 240. The time axis aligning unit 231, the adaptive filter unit 245, the target signal blocking unit 250, and the signal subtraction unit 255 may be further included.

음성 입력부(110)가 마이크로폰 어레이(210)로 형성된 경우, 시간 축 정렬부(231)는 입력되는 적어도 하나의 음성 신호들을 동기화(sync: syncronize)시켜 출력한다. 구체적으로, 시간 축 정렬부(231)는 마이크로폰 어레이(210)를 통해 입력되는 다수개의 음성 신호들의 입력 지연을 보상한다. 즉, 시간 축 정렬부(231)는 마이크로폰 어레이(210)를 통해 입력되는 원하는 방향의 음성 신호가 입력될 때, 원하는 방향의 음성 신호가 마이크로폰 어레이(210)에 포함되는 각 마이크로폰에 도달하는 시간을 동기화(sync: synchronize)시킨다.When the voice input unit 110 is formed of the microphone array 210, the time axis alignment unit 231 synchronizes and outputs at least one input voice signal. In detail, the time axis alignment unit 231 compensates for an input delay of a plurality of voice signals input through the microphone array 210. That is, the time axis alignment unit 231 determines the time when the voice signal in the desired direction reaches each microphone included in the microphone array 210 when the voice signal in the desired direction is input through the microphone array 210. Sync (sync)

여기서, 원하는 방향의 음성 신호란, 음성 인식 장치(100)에서 인식 대상이 되는 목적 음성 성분으로, 사용자등이 목적한 단어 또는 명령에 대응되어 사용자가 발성한 음성 신호를 뜻한다. 이하에서는 설명의 편의상, 음성 인식 장치(100)에서 인식 대상이 되며, 음성 입력부(110)가 입력받은 원하는 방향의 음성 신호를 '목적 음성 성분'이라 한다. Here, the voice signal in a desired direction is a target voice component to be recognized by the voice recognition apparatus 100, and means a voice signal spoken by the user in response to a target word or command. Hereinafter, for convenience of description, the voice recognition apparatus 100 becomes a recognition target, and a voice signal in a desired direction input by the voice input unit 110 is referred to as a 'target voice component'.

목표 신호 추출부(235)는 음성 신호에서 인식 대상인 목적 음성 성분을 추출한다. 구체적으로, 목표 신호 추출부(235)는 음성 신호에서 목적 음성 성분을 빔포밍(beamforming)시킬 수 있다. 즉, 목표 신호 추출부(235)는 음성 입력부(110)로 입력된 음성 신호 중 원하는 방향의 음성 신호인 목적 음성 성분을 강화시켜 출력하는 것이다. 예를 들어, 두 개 이상의 마이크로폰을 이용하여 하나의 음원(음성 신호)이 공간적으로 상이하게 배치되는 경우, 원하는 방향의 음성 신호를 강화시킬 수 있다. 이하에서는, 목표 신호 추출부(235)의 출력 신호를 목적 음성 신호라 한다. The target signal extractor 235 extracts a target speech component to be recognized from the speech signal. In detail, the target signal extractor 235 may beamform a target voice component from the voice signal. That is, the target signal extractor 235 reinforces and outputs a target voice component which is a voice signal in a desired direction among the voice signals input to the voice input unit 110. For example, when one sound source (voice signal) is spatially arranged differently using two or more microphones, a voice signal in a desired direction may be enhanced. Hereinafter, the output signal of the target signal extractor 235 is referred to as the target voice signal.

음성 입력부(110)는 사용자로부터 소정 단어 등을 입력 받는데 있어서, 원하지 않는 잡음(noise), 또는 원하지 않는 방향으로부터 입력되는 음성 신호 등을 목적 음성 성분과 함께 입력받을 수 있다. 여기서, 원하지 않는 방향으로부터 입력되는 음성 신호의 예로는 사용자가 아닌 타인에 의하여 생성된 음성 신호 등이 있다. 음성 입력부(110)가 잡음 등이 포함된 음성 신호를 입력받으면, 목표 신호 추출부(235)는 목적 음성 성분을 강화시키고, 목적 음성 성분 이외의 음성 성분인 잡음 등을 감쇄시킬 수 있다. When the voice input unit 110 receives a predetermined word or the like from the user, the voice input unit 110 may receive an unwanted noise or a voice signal input from an undesired direction together with the target voice component. Here, an example of a voice signal input from an undesired direction may be a voice signal generated by someone other than the user. When the voice input unit 110 receives a voice signal including noise or the like, the target signal extractor 235 may enhance the target voice component and attenuate noise or the like that is a voice component other than the target voice component.

목표 신호 차단부(250)는 시간 축 정렬부(231)에서 출력되는 음성 신호에서 원하는 방향의 신호인 목표 음성 성분을 차단한다. 따라서, 목표 신호 차단부(250)에서 출력되는 신호는 음성 입력부(110)로 입력된 음성 신호에서 목표 음성 성분이 제거된 신호가 된다. The target signal blocking unit 250 blocks a target voice component that is a signal in a desired direction from the voice signal output from the time axis alignment unit 231. Therefore, the signal output from the target signal blocking unit 250 becomes a signal from which the target voice component is removed from the voice signal input to the voice input unit 110.

적응 필터부(245)는 목표 신호 차단부(250)에서 출력되는 신호를 입력받고, 목표 신호 차단부(250)의 출력 신호의 파워가 최소가 되도록 적응 필터의 계수를 갱신(update)시킨다. 여기서, 적응 필터의 계수를 갱신시킬지 여부는 제어부(240)의 제어에 따라서 결정될 수 있다. 그리고, 적응 필터부(245)는 제어부(240)의 제어에 따라서, 목표 신호 차단부(250)의 출력 신호에 대하여 갱신된 적응 필터의 계수를 적용하여 적응 필터링(adaptive filtering)를 수행한다. The adaptive filter unit 245 receives the signal output from the target signal blocker 250 and updates the coefficient of the adaptive filter so that the power of the output signal of the target signal blocker 250 is minimized. Here, whether to update the coefficient of the adaptive filter may be determined under the control of the controller 240. The adaptive filter unit 245 performs adaptive filtering by applying the updated adaptive filter coefficient to the output signal of the target signal blocking unit 250 under the control of the controller 240.

신호 차감부(255)는 목표 신호 추출부(235)의 출력 신호와 적응 필터부(245)의 출력 신호를 입력받고, 목표 신호 추출부(235)의 출력 신호에서 적응 필터부(245)의 출력 신호를 차감시켜 출력한다. 따라서, 신호 차감부(255)에서 출력되는 신호는 음성 입력부(110)로 입력된 음성 신호에서 잡음 또는 원하지 않는 방향의 신호 등이 제거되고, 목표 음성 성분이 강조된 신호가 될 수 있다. The signal subtractor 255 receives the output signal of the target signal extractor 235 and the output signal of the adaptive filter unit 245, and outputs the adaptive filter unit 245 from the output signal of the target signal extractor 235. Subtract the signal and output it. Therefore, the signal output from the signal subtractor 255 may be a signal in which noise or an unwanted direction signal is removed from the voice signal input to the voice input unit 110, and a signal in which the target voice component is emphasized.

제어부(240)는 목적 음성 신호의 파워(power)를 산출한다. 여기서, 목적 음성 신호의 파워를 제1 파워(Py)라 한다. The controller 240 calculates power of a target voice signal. Here, the power of the target voice signal is referred to as first power Py.

그리고, 제어부(240)는 목표 신호 추출부(235)를 통과하지 않은 음성 신호의 파워를 산출한다. 즉, 목적 음성 성분을 추출하지 않은 음성 신호의 파워를 산출한다. 여기서, 목적 음성 성분을 추출하지 않은 음성 신호인, 목표 신호 추출부(235)를 통과하지 않은 음성 신호의 파워를 제2 파워(Pt)라 한다. 여기서, 목표 신호 추출부(235)를 통과하지 않은 음성 신호는 시간 축 정렬부(231)의 출력 신호인 도 2에 도시된 제1 지점(N1)에 잡히는 신호가 될 수 있으며, 시간 축 정렬부(231)의 출력 신호의 파워가 제2 파워(Pt)로써 산출될 수 있다. The controller 240 calculates power of the voice signal that does not pass through the target signal extractor 235. That is, the power of the audio signal from which the target audio component is not extracted is calculated. Here, the power of the voice signal that does not pass through the target signal extractor 235, which is the voice signal from which the target voice component has not been extracted, is referred to as a second power Pt. Here, the voice signal that does not pass through the target signal extractor 235 may be a signal caught at the first point N1 shown in FIG. 2, which is an output signal of the time axis alignment unit 231, and the time axis alignment unit. The power of the output signal of 231 may be calculated as the second power Pt.

또한, 목표 신호 차단부(250)의 출력 신호인 도 2에 도시된 제3 지점(N3)에 잡히는 신호의 파워 또는 적응 필터부(245)의 출력 신호인 도 2에 도시된 제2 지점(N2)에 잡히는 신호의 파워가 제2 파워(Pt)로써 산출될 수 있다. Further, the second point N2 shown in FIG. 2, which is the output signal of the adaptive filter unit 245 or the power of the signal caught at the third point N3 shown in FIG. 2, which is the output signal of the target signal blocking unit 250. The power of the signal captured by) may be calculated as the second power Pt.

즉, 제1 파워는 목표 음성 성분이 강화된 신호의 파워(Py)이다. 그리고, 제2 파워(Pt)는 목표 신호 향상부(235) 이외의 구성에서 출력되는 신호는 파워로, 목표 음성 성분이 강화되지 않은 신호의 파워가 되며, 전술한 시간 축 정렬부(231)의 출력 신호, 목표 신호 차단부(250)의 출력 신호, 또는 적응 필터부(245)의 출력 신호 중 어느 하나의 파워가 될 수 있다. That is, the first power is the power Py of the signal in which the target voice component is enhanced. The second power Pt is a signal output in a configuration other than the target signal enhancement unit 235, and is the power of the signal in which the target voice component is not enhanced. The power may be one of an output signal, an output signal of the target signal blocking unit 250, or an output signal of the adaptive filter unit 245.

또한, 제어부(240)는 제2 파워(Pt)와 제1 파워(Py)의 비율을 산출한다. 구체적으로, 제어부(240)는 제2 파워에 대비한 제1 파워의 비율, 즉, 제1 파워/ 제2 파워 = Py/Pt,을 산출할 수 있다. 구체적으로, 제2 파워에 대비한 제1 파워의 비율은 로그 스케일(log scale)로 산출될 수 있다. 예를 들어, 제1 파워/ 제2 파워의 비율 =log(Py/Pt) = log Py - log Pt 로 산출될 수 있다. In addition, the controller 240 calculates a ratio of the second power Pt and the first power Py. In detail, the controller 240 may calculate a ratio of the first power to the second power, that is, first power / second power = Py / Pt. In detail, the ratio of the first power to the second power may be calculated on a log scale. For example, the ratio of first power / second power = log (Py / Pt) = log Py−log Pt may be calculated.

제1 파워(Py), 제2 파워(Pt), 및 파워의 비율(Py/Pt)은 음성 신호의 일 프레임 단위로 산출될 수 있다. 일 프레임의 길이는 음성 인식 전처리 부(130)가 음성 신호를 처리하는 속도인 샘플링 레이트(sampling rate) 또는 음성 인식 장치(100)의 동작 주파수 등에 따라 달라질 수 있으며, 예를 들어, 10-20ms의 크기를 가질 수 있다. The first power Py, the second power Pt, and the power ratio Py / Pt may be calculated in units of one frame of the voice signal. The length of one frame may vary depending on a sampling rate, which is a speed at which the speech recognition preprocessor 130 processes a speech signal, or an operating frequency of the speech recognition apparatus 100, for example, 10-20 ms. May have a size.

또한, 제1 파워(Py) 및 제2 파워(Pt), 및 파워의 비율(Py/Pt)은 소정개의 프레임들마다 산출될 수도 있다. 예를 들어, 3개의 프레임 단위로 제1 파워(Py) 및 제2 파워(Pt), 및 파워의 비율(Py/Pt)을 산출할 수도 있을 것이다. In addition, the first power Py, the second power Pt, and the ratio Py / Pt of power may be calculated for each of predetermined frames. For example, the first power Py and the second power Pt and the ratio Py / Pt may be calculated in units of three frames.

그리고, 제어부(240)는 산출된 비율 값에 근거하여, 목표 음성 성분이 포함된 구간인 음성 구간을 검출한다. The controller 240 detects a voice section that is a section including the target voice component based on the calculated ratio value.

예를 들어, 음성 입력부(110)로 입력된 음성 신호에 목표 음성 성분만이 존재하고 잡음 등 전혀 존재하지 않는 경우, 제1 파워(Py) > 0, 제2 파워(Pt) = 0 이 되며, 비율(Py/Pt)은 무한대가 된다. For example, when only the target voice component is present in the voice signal input to the voice input unit 110 and no noise or the like is present, the first power Py> 0 and the second power Pt = 0, The ratio Py / Pt becomes infinity.

또한, 음성 입력부(110)로 입력된 음성 신호에 목표 음성 성분은 존재하지 않고, 잡음 또는 원하지 않는 방향의 신호만이 존재하는 경우, 제1 파워(Py) = 0, 제2 파워(Pt) > 0 이 되며, 비율(Py/Pt)은 0 이 된다.In addition, when the target voice component does not exist in the voice signal input to the voice input unit 110 and only a signal in a noise or unwanted direction exists, the first power Py = 0 and the second power Pt> 0, and the ratio Py / Pt is 0.

또한, 음성 입력부(110)로 입력된 음성 신호에 목표 음성 성분과 잡음 등의 원하지 않는 방향의 신호 등이 혼재할 경우, 제1 파워(Py) > 0, 제2 파워(Pt) > 0 이 되며, 비율은 (Py/Pt) > 0 이 될 것이다. 또한, 목표 음성 성분과 잡음이 혼재할 경우에도 목표 음성 성분에 비하여 잡음이 많으면 비율(Py/Pt)은 0 에 가까운 값이 되고, 잡음에 비하여 목표 음성 성분이 많으면 비율(Py/Pt)은 큰 값이 된다. In addition, when a voice signal input to the voice input unit 110 is mixed with a target voice component and a signal in an undesired direction such as noise, the first power Py> 0 and the second power Pt> 0. , The ratio will be (Py / Pt)> 0. In addition, even when the target speech component and noise are mixed, if the noise is larger than the target speech component, the ratio (Py / Pt) is close to zero. If the target speech component is larger than the noise, the ratio (Py / Pt) is large. Value.

또한, 입력된 음성 신호에 목표 음성 성분이 포함되어 있는지 여부를 판단하는데 있어서, 목표 음성 성분이 포함되어 있으면, 비율(Py/Pt)은 소정값 이상 또는 초과하는 값을 가지게 된다. 여기서, 목표 음성 성분이 포함되어 있을 때 산출된 비율(Py/Pt)의 최소 값을 문턱 값(Rth)으로 설정해 놓을 수 있다. 문턱 값(Rth)은 마이크로폰 배열(microphone array)에서의 마이크로폰 간 간격, 음성 입력부(110)의 제품 사양 또는 설정(setting) 정도, 예를 들어, 마이크로폰의 음향 수신 민감도 등, 또는 잡음 환경 등에 따라서 달라질 수 있는 값이다. 문턱 값(Rth)은 모델 별로 서로 다른 제품 사양 및 설정을 갖는 음성 인식 장치(100)마다 실험적으로 최적화된 값으로 설정될 수 있다. Further, in determining whether or not the target voice component is included in the input voice signal, if the target voice component is included, the ratio Py / Pt has a value above or exceeding a predetermined value. Here, the minimum value of the ratio Py / Pt calculated when the target voice component is included may be set as the threshold value Rth. The threshold value Rth may vary depending on the interval between the microphones in the microphone array, the product specification or setting of the voice input unit 110, for example, the acoustic reception sensitivity of the microphone, or the noise environment. Possible values The threshold value Rth may be set to an experimentally optimized value for each speech recognition apparatus 100 having different product specifications and settings for each model.

설정된 문턱 값(Rth)는 음성 인식 장치(100)의 제작 당시에 제어부(240) 내의 소정 저장 공간 또는 음성 인식 전 처리부(130) 내의 소정 저장 공간(미도시) 내에 저장될 수 있으며, 음성 인식 장치(100)의 사용자 또는 제작자에 의하여 업데이트 될 수 있다. The set threshold value Rth may be stored in a predetermined storage space in the control unit 240 or in a predetermined storage space (not shown) in the pre-speech processing unit 130 at the time of manufacture of the speech recognition apparatus 100. 100 may be updated by the user or the producer.

따라서, 제어부(240)는 산출된 비율(Py/Pt) 값에 근거하여 목표 음성 성분이 포함된 구간인 음성 구간을 검출할 수 있다. 전술한 바와 같이, 비율(Py/Pt) 값은 일 프레임 단위로 산출될 수 있으며, 그에 따라서 제어부(240)는 일 프레임 단위로 음성 구간을 검출할 수 있다. Therefore, the controller 240 may detect a voice section that is a section including the target voice component based on the calculated ratio Py / Pt. As described above, the ratio Py / Pt may be calculated in units of one frame, and accordingly, the controller 240 may detect a voice section in units of one frame.

목표 음성 성분이 최소한으로 포함된 음성 신호, 예를 들어, 1음절의 단어에 대응되는 음성 성분이 포함된 음성 신호, 는 짧아도 0.5초 이상 음성이 지속된다. 따라서, 목표 음성 성분이 포함된 음성 신호의 경우 최소한 수 십 프레임 연속으로 비율(Py/Pt) 값이 문턱 값(Rth) 이상 또는 초과하게 될 것이다. A speech signal containing a minimum of a target speech component, for example, a speech signal containing a speech component corresponding to a single syllable word, lasts at least 0.5 seconds. Therefore, in the case of the speech signal including the target speech component, the ratio Py / Pt value will be greater than or exceed the threshold value Rth for at least several ten consecutive frames.

또한, 비율(Py/Pt) 값이 문턱 값(Rth) 미만 또는 이하게 되는 구간을 목표 음성 성분에 대응되는 음성 구간의 끝점으로 판단할 수 있다. 다만, 비율(Py/Pt) 값이 문턱 값(Rth) 미만 또는 이하게 되더라도, 전 후 프레임에서의 음성 구간 검출 결과를 고려하여, 음성 구간의 끝점이 아닌 것으로 판단할 수 있다. In addition, a section in which the ratio Py / Pt is less than or equal to the threshold value Rth may be determined as an end point of the speech section corresponding to the target speech component. However, even if the ratio Py / Pt becomes less than or equal to the threshold value Rth, it may be determined that it is not an end point of the speech section in consideration of the detection result of the speech section in the before and after frames.

일반적으로 일 프레임은 10-20ms의 크기를 가질 수 있으며, 목표 음성 성분이 최소한으로 포함된 음성 신호 예를 들어, 1음절의 단어에 대응되는 음성 성분이 포함된 음성 신호, 는 짧아도 0.5초의 크기를 가진다. 이러한 경우, 목표 음성 성분이 포함된 음성 신호의 경우 최소한 수십 프레임 연속으로 비율(Py/Pt) 값이 문턱 값(Rth) 이상 또는 초과하게 된다. 그리고, 사용자의 발성과 발성 사이의 목표 음성 성분이 입력되지 않는 기간 또한 짧아도 0.5초 이상의 지속된다. 따라서, 하나의 음성 구간이 끝나게 되면 비율(Py/Pt) 값은 최소 수십 프레임 연속으로 문턱 값(Rth) 미만 또는 이하가 된다. 따라서, 전 후 프레임에서의 비율(Py/Pt) 값이 문턱 값(Rth) 이상이나 해당 프레임에서만 비율(Py/Pt) 값이 문턱 값(Rth) 미만이 되면, 이 경우에는 음성 구간의 끝점으로 판단하지 않는 것이다. In general, one frame may have a size of 10-20 ms, and a speech signal including a minimum target speech component, for example, a speech signal including a speech component corresponding to a word of one syllable, may have a size of 0.5 seconds. Have In this case, in the case of the speech signal including the target speech component, the ratio Py / Pt value exceeds or exceeds the threshold value Rth for at least several tens of consecutive frames. Then, the period in which the target voice component between the user's utterance and utterance is not input also lasts at least 0.5 seconds. Therefore, when one voice section ends, the ratio Py / Pt becomes less than or equal to the threshold value Rth at least several tens of consecutive frames. Therefore, if the ratio Py / Pt in the front and back frames is greater than or equal to the threshold value Rth, but only in the frame, the ratio Py / Pt is less than the threshold value Rth. I do not judge.

또한, 제어부(240)는 비율(Py/Pt)의 크기에 따라서 해당 프레임에서의 목표 음성 성분이 존재할 가능성에 대한 스코어(score)를 할당할 수 있다. 비율(Py/Pt)에 대응되는 스코어(score)는 매핑 테이블(mapping table) 형태로 제어부(240) 내의 소정 저장 공간 또는 음성 인식 전처리부(130) 내의 소정 저장 공간(미도시) 내에 저장될 수 있으며, 음성 인식 장치(100)의 사용자 또는 제작자에 의하여 업데이트 될 수 있다. In addition, the controller 240 may allocate a score for the possibility that the target speech component in the corresponding frame is present according to the magnitude of the ratio Py / Pt. A score corresponding to the ratio Py / Pt may be stored in a predetermined storage space in the controller 240 or in a predetermined storage space (not shown) in the speech recognition preprocessor 130 in the form of a mapping table. It may be updated by the user or producer of the voice recognition device 100.

이 경우, 목표 음성 성분이 존재하는 것으로 판단할 수 있는 문턱 스코어 값을 실험적으로 최적화하여 설정해 두고, 해당 프레임에서의 스코어와 문턱 스코어 값을 비교하여 목표 음성 성분이 존재하는 음성 구간인지 여부를 판단할 수 있다. 여기서, 문턱 스코어 값은 전술한 문턱 값(Rth)과 유사하게 실험적으로 최적화되어 설정될 수 있으며, 제어부(240)는 해당 프레임에서의 스코어가 문턱 스코어 값 이상 또는 초과이면, 목표 음성 성분이 존재하는 음성 구간인 것으로 판단할 수 있다. In this case, the threshold score value that can be determined to exist as the target speech component is experimentally optimized and set, and it is determined whether the target speech component exists in the speech section by comparing the score and the threshold score value in the corresponding frame. Can be. Here, the threshold score value may be experimentally optimized and set similarly to the above-described threshold value Rth, and the controller 240 may determine that the target voice component exists when the score in the corresponding frame is greater than or greater than the threshold score value. It can be determined that the voice section.

또는, 제어부(240)는 현재 프레임에서의 스코어 값을 이전 프레임들의 스코어 값들과 비교하여 목표 음성 성분이 존재하는 음성 구간의 끝점을 판별할 수 있다. 구체적으로, 이전 프레임들의 스코어 값들에 비하여 현재 프레임에서의 스코어 값이 갑자기 감소하고, 이후 프레임들에서도 유지되면, 현재 프레임을 음성 구간의 끝점으로 판별할 수 있다. Alternatively, the controller 240 may determine the end point of the voice section in which the target voice component exists by comparing the score value in the current frame with the score values of the previous frames. Specifically, if the score value in the current frame suddenly decreases compared to the score values of the previous frames and is also maintained in the subsequent frames, the current frame may be determined as an end point of the voice interval.

전술한 바와 같이, 파워 간의 비율(Py/Pt)을 이용하여 음성 구간 여부를 판단하면, 음성 신호 프레임 별로 상관도(correlation)를 구하여 음성 구간 여부를 판단하거나, 음성 신호 간의 간섭도(coherence)를 구하여 음성 구간 여부를 판단하는 경우에 비하여, 연산량을 감소시킬 수 있다. 그에 따라서, 음성 인식 장치(100)는 연산량 감소에 따라 음성 구간 검출 속력을 증가시킬 수 있다. As described above, if it is determined whether or not the voice interval by using the ratio between the power (Py / Pt), to determine whether the voice interval by obtaining a correlation (correlation) for each voice signal frame, or determine the coherence between the voice signal Computation can be reduced compared to the case of determining whether the voice section is obtained. Accordingly, the speech recognition apparatus 100 may increase the speech section detection speed as the amount of calculation decreases.

또한, 상관도(correlation)를 구하여 음성 구간 여부를 판단하게 될 경우, 잡음에 취약한 단점이 있는 것에 비하여, 본원에서는 목표 음성 성분이 강화된 신호의 파워(Py)와, 잡음 등이 포함되어 있는 제2 파워(Pt)를 모두 고려함으로써, 잡음이 없는 깨끗한 환경이든 잡음이 심한 환경이든 상관없이 정확하게 음성 구간을 검출할 수 있다. In addition, when determining whether or not the voice interval by obtaining a correlation (correlation), compared to the disadvantages that are vulnerable to noise, in the present application, the power (Py) of the signal with the target voice component is enhanced, noise, etc. By taking into account both powers (Pt), it is possible to accurately detect the speech section whether it is in a clean or noisy environment.

즉, 음성 인식 장치(100)는 음성 신호의 파워를 산출하여 음성 구간 여부를 판단함으로써, 음성 구간을 빠르고 정확하게 추출할 수 있다. 또한, 음성 인식 전단계인 음성 인식 전처리 부(130)에서 음성 구간을 정확하게 추출함으로써 후속과정에서의 음성 인식율을 증가시킬 수 있다. That is, the speech recognition apparatus 100 may quickly and accurately extract the speech section by calculating the power of the speech signal to determine whether the speech section is present. In addition, by accurately extracting the speech section from the speech recognition preprocessor 130, which is a previous stage of speech recognition, the speech recognition rate in a subsequent process may be increased.

또한, 제어부(240)는 음성 구간의 끝점이 검출되면, 음성 구간의 정보를 음성 인식부(150)로 전송하여, 음성 인식부(150)가 검출된 음성 구간에 포함된 음성 신호를 인식할 수 있도록 한다. In addition, when the end point of the speech section is detected, the controller 240 transmits the information of the speech section to the speech recognizer 150 so that the speech recognizer 150 may recognize the speech signal included in the detected speech section. Make sure

또한, 제어부(240)는 음성 구간의 끝점이 검출되면, 음성 구간과 음성 구간 이외의 구간(즉, 목표 음성 성분이 포함되지 않은 음성 신호의 구간)에 적용되는 적응 필터 계수(adaptive filter coefficient)가 선택적으로 갱신(update) 되도록 적응 필터부(245)를 제어할 수 있다. 구체적으로, 제어부(240)는 음성 구간의 끝점이 검출되면, 음성 구간 이외의 구간에 적용되는 적응 필터 계수만이 갱신되도록 하고, 음성 구간에 적용되는 적응 필터 계수는 갱신되지 않도록 적응 필터부(245)를 제어할 수 있다. In addition, when the end point of the speech section is detected, the controller 240 has an adaptive filter coefficient applied to the speech section and the section other than the speech section (that is, the section of the speech signal not including the target speech component). The adaptive filter unit 245 may be controlled to be selectively updated. Specifically, when the end point of the speech section is detected, the control unit 240 only updates the adaptive filter coefficients applied to the sections other than the speech section, and the adaptive filter unit 245 not to update the adaptive filter coefficients applied to the speech section. ) Can be controlled.

전술한 바와 같이, 본원의 제어부(240)가 적응 필터 계수를 음성 구간에서는 갱신시키지 않음으로써 목적 음성 성분을 포함하는 음성 신호가 왜곡되는 것을 방지할 수 있다. 또한, 음성 구간 이외의 구간에서는 적응 필터 계수를 갱신시킴으로써, 적응 필터링이 더욱 유효하게 수행되어 적응 필터부(245)의 파워가 최소화될 수 있도록 할 수 있다. As described above, the controller 240 of the present application does not update the adaptive filter coefficient in the speech section, thereby preventing the speech signal including the target speech component from being distorted. In addition, by updating the adaptive filter coefficients in the sections other than the voice section, the adaptive filtering may be more effectively performed to minimize the power of the adaptive filter unit 245.

또한, 제어부(240)는 음성 구간의 검출을 완료한 후, 음성 구간 및 음성 구간 이외의 구간에서 차등적으로 잡음 제거를 수행하도록 음성 인식부(150)로 제어 신호를 출력할 수 있다. 예를 들어, 음성 구간에 대하여는 잡음 제거를 수행하지 않고, 음성 구간 이외의 구간에 대하여는 잡음 제거를 수행하도록 요청하는 제어 신호를 음성 인식부(150)로 출력할 수 있다. In addition, after the detection of the speech section is completed, the controller 240 may output a control signal to the speech recognition unit 150 to perform noise removal differentially in the sections other than the speech section and the speech section. For example, the voice recognition unit 150 may output a control signal to the voice recognition unit 150 to perform noise removal on a section other than the voice section without performing noise removal on the voice section.

따라서, 음성 인식 장치(100)는 추출된 음성 구간에 대하여 선택적인 잡음 제거를 수행함으로써, 음성 인식율을 증가시킬 수 있다. Therefore, the speech recognition apparatus 100 may increase the speech recognition rate by performing selective noise removal on the extracted speech section.

도 3은 본 발명의 일 실시예에 따른 음성 인식 방법을 나타내는 흐름도이다. 이하에서는, 도 2의 음성 인식 장치(100)의 각 구성을 참조하여 본 발명의 일 실시예에 따른 음성 인식 방법을 설명한다. 3 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention. Hereinafter, a voice recognition method according to an exemplary embodiment of the present invention will be described with reference to each configuration of the voice recognition apparatus 100 of FIG. 2.

본 발명의 일 실시예에 따른 음성 인식 방법은 적어도 하나의 음성 신호를 입력받는다(310 단계). 310 단계는 음성 인식 장치(100)의 음성 입력부(110)에서 수행될 수 있다. The speech recognition method according to an embodiment of the present invention receives at least one voice signal (step 310). Step 310 may be performed by the voice input unit 110 of the voice recognition apparatus 100.

입력받은 음성 신호에서 인식 대상인 목적 음성 성분을 추출하여 목적 음성 신호를 출력한다(320 단계). 320 단계는 목표 신호 추출부(235)에서 수행될 수 있다. 또한, 다수개의 마이크로폰을 통하여 음성 신호를 입력받는 마이크로폰 어레이(210)를 통하여 음성 신호를 입력받는 경우, 다수개의 음성 신호들을 동기화시키는 단계(미도시)를 더 포함할 수 있으며, 상기 단계는 시간 축 정렬부(231)에서 수행될 수 있다. In operation 320, a target voice component, which is a recognition target, is extracted from the input voice signal. Operation 320 may be performed by the target signal extractor 235. In addition, when a voice signal is input through the microphone array 210 which receives voice signals through a plurality of microphones, the method may further include synchronizing a plurality of voice signals (not shown), and the step may include a time axis. It may be performed in the alignment unit 231.

목적 음성 신호의 파워인 제1 파워(Py)와 목적 음성 성분이 추출되지 않은 음성 신호의 파워인 제2 파워(Pt)를 산출한다(330 단계). 330 단계는 제어부(240)에서 수행될 수 있다. The first power Py, which is the power of the target voice signal, and the second power Pt, which is the power of the voice signal from which the target voice component is not extracted, are calculated (step 330). Operation 330 may be performed by the controller 240.

그리고, 제2 파워(Pt)와 상기 제1 파워(Py)의 비율을 산출한다(340 단계). 비율은 제2 파워(Pt) 대비 상기 제1 파워(Py)의 비가 될 수 있으며, Py/Pt 값이 된다. In operation 340, the ratio of the second power Pt to the first power Py is calculated. The ratio may be a ratio of the first power Py to the second power Pt, and may be a Py / Pt value.

비율(Py/Pt)에 근거하여 목적 음성 성분이 포함된 구간인 음성 구간을 검출한다(350 단계). 340 단계 및 350 단계는 제어부(240)에서 수행될 수 있다. Based on the ratio Py / Pt, a voice section, which is a section including the target voice component, is detected (step 350). Steps 340 and 350 may be performed by the controller 240.

음성 구간의 검출이 완료되면, 음성 구간에 존재하는 음성 신호를 인식한다(360 단계). 360 단계는 음성 인식부(150)에서 수행될 수 있다. When the detection of the voice interval is completed, the voice signal existing in the voice interval is recognized (step 360). Operation 360 may be performed by the voice recognition unit 150.

도 4는 도 3의 음성 인식 방법을 더욱 상세하게 나타내는 흐름도이다. 도 3에서의 350 단계는 도 4에 도시된 410, 420, 430 및 440 단계를 포함할 수 있다. 또한, 본 발명의 일 실시예에 따른 음성 인식 방법은 460 단계를 더 포함할 수 있다. 이하에서는 도 3의 350 단계에 포함되는 410, 420, 430 및 440 단계 및 460 단계만을 설명하며, 이외의 단계 구성은 도 3에서와 동일하므로, 상세 설명은 생략한다. 4 is a flowchart illustrating the speech recognition method of FIG. 3 in more detail. Operation 350 in FIG. 3 may include operations 410, 420, 430, and 440 illustrated in FIG. 4. In addition, the speech recognition method according to an embodiment of the present invention may further include step 460. Hereinafter, only the steps 410, 420, 430, and 440 and 460 included in step 350 of FIG. 3 will be described, and the rest of the steps will be the same as in FIG.

도 4를 참조하면, 전술한 340 단계에서 구한 파워의 비율(Py/Pt)이 소정 문턱값(Rth) 이상 또는 초과가 되는지 판단한다(410 단계). 410 단계의 판단은 제어부(240)에서 수행될 수 있다. 또한, 소정 문턱 값(Rth)은 실험적으로 최적화되어 설정된 값이 된다. Referring to FIG. 4, it is determined whether the ratio Py / Pt of the power obtained in step 340 described above is equal to or greater than a predetermined threshold value Rth (step 410). The determination of step 410 may be performed by the controller 240. In addition, the predetermined threshold value Rth becomes an experimentally optimized value.

비율(Py/Pt)이 소정 문턱값(Rth) 이상 또는 초과가 되는 것으로 판단되면, 목적 음성 성분이 포함된 음성 신호의 구간인 음성 구간인 것으로 판단한다(420 단계). If it is determined that the ratio Py / Pt exceeds or exceeds the predetermined threshold value Rth, it is determined that the ratio Py / Pt is a voice section that is a section of the voice signal including the target voice component (step 420).

비율(Py/Pt)이 소정 문턱값(Rth) 이상 또는 초과가 되지 않으면, 음성 구간의 끝점인지 여부를 판단한다(430 단계). 구체적으로, 이전 프레임들에서의 비율(Py/Pt) 및 이후 프레임들에서의 비율(Py/Pt) 값을 고려하여, 비율(Py/Pt)이 소정 문턱값(Rth) 이상 또는 초과가 되지 않는 프레임이 음성 구간의 끝점인지 여부를 판단한다. 전술한 예와 같이, 전 후 프레임들에서의 비율(Py/Pt) 값이 문턱 값(Rth) 이상이나 해당 프레임에서만 비율(Py/Pt) 값이 문턱 값(Rth) 미만이 되면, 이 경우에는 음성 구간의 끝점으로 판단하지 않는 것이다. If the ratio Py / Pt does not exceed or exceed the predetermined threshold value Rth, it is determined whether it is the end point of the voice section (step 430). Specifically, considering the ratio Py / Pt in the previous frames and the ratio Py / Pt in the subsequent frames, the ratio Py / Pt does not exceed or exceed the predetermined threshold Rth. It is determined whether the frame is the end point of the speech section. As in the above example, if the ratio Py / Pt in the front and back frames is greater than or equal to the threshold value Rth, but only in the corresponding frame, the ratio Py / Pt is less than the threshold value Rth. It is not judged as the end point of the voice section.

음성 구간의 끝점이 검출되어, 전체의 음성 구간 검출을 완료한다(440 단계). An end point of the voice interval is detected, and the entire voice interval detection is completed (step 440).

검출된 음성 구간과 음성 구간 이외의 구간에서 적용되는 적응 필터의 계수를 선택적으로 갱신한다(460 단계).The coefficients of the adaptive filter applied in the detected speech section and the section other than the speech section are selectively updated (step 460).

도 3 및 도 4에서 설명한 본 발명의 일 실시예에 따른 음성 인식 방법은 도 1 내지 도 2에서 상술한 본 발명의 일 실시예에 따른 음성 인식 장치와 기술적 사상 및 동작 구성이 동일하므로, 상세한 설명 및 중복되는 설명은 생략하도록 한다. The speech recognition method according to an embodiment of the present invention described with reference to FIGS. 3 and 4 has the same technical spirit and operation configuration as the speech recognition apparatus according to the embodiment of the present invention described above with reference to FIGS. And redundant description will be omitted.

비록 상기 설명이 다양한 실시예들에 적용되는 본 발명의 신규한 특징들에 초점을 맞추어 설명되었지만, 본 기술 분야에 숙달된 기술을 가진 사람은 본 발명의 범위를 벗어나지 않으면서도 상기 설명된 장치 및 방법의 형태 및 세부 사항에서 다양한 삭제, 대체, 및 변경이 가능함을 이해할 것이다. 따라서, 본 발명의 범위는 상기 설명에서보다는 첨부된 특허청구범위에 의해 정의된다. 특허청구범위의 균등 범위 안의 모든 변형은 본 발명의 범위에 포섭된다.Although the foregoing description has been focused on the novel features of the invention as applied to various embodiments, those skilled in the art will appreciate that the apparatus and method described above without departing from the scope of the invention. It will be understood that various deletions, substitutions, and changes in form and detail of the invention are possible. Accordingly, the scope of the invention is defined by the appended claims rather than in the foregoing description. All modifications within the scope of equivalents of the claims are to be embraced within the scope of the present invention.

100: 음성 인식 장치
110: 음성 입력부
130: 음성 인식 전처리부
150: 음성 인식부
210: 마이크로 폰 어레이
231: 시간 축 정렬부
235: 목표 신호 추출부
240: 제어부
245: 적응 필터부
250: 목표 신호 차단부
255: 신호 차감부100: speech recognition device
110: voice input unit
130: speech recognition preprocessor
150: speech recognition unit
210: microphone array
231: time axis alignment
235: target signal extraction unit
240:
245: adaptive filter unit
250: target signal blocking unit
255: signal subtraction

Claims

A voice input unit configured to receive at least one voice signal;
A target signal extracting unit for extracting a target speech component that is a recognition target from the speech signal and outputting a target speech signal;
A first power that is the power of the target voice signal and a second power that is the power of the voice signal that does not pass through the target signal extracting unit are calculated, and the ratio of the second power and the first power is calculated. A controller for detecting a voice section based on the target voice component based on the target voice component; And
And a speech recognition unit for recognizing the speech signal existing in the speech section.

The method of claim 1, wherein the target signal extractor
And a beamforming the target speech component and outputting the beamformed target speech component as the target speech signal.

The apparatus of claim 1, wherein the control unit
And calculating a ratio (first power / second power) of the first power to the second power, and determining the voice interval when the ratio exceeds or exceeds a predetermined threshold value.

4. The apparatus of claim 3, wherein the control unit
The speech recognition apparatus of claim 1, wherein the first power and the second power are calculated in at least one frame unit, and the first power and the second power are determined.

4. The apparatus of claim 3, wherein the control unit
And determining the end point of the voice section when the ratio (first power / second power) of the first power to the second power is less than or less than the predetermined threshold.

The apparatus of claim 3, wherein the ratio of the first power to the second power is calculated on a logarithmic scale.

The method of claim 1, wherein the voice input unit
And a microphone array including at least one microphone and receiving the at least one voice signal through the at least one microphone.

The method of claim 7, wherein
A time axis alignment unit for synchronizing and outputting the at least one voice signal;
A target signal blocker to block the target voice component from the voice signal output from the time axis alignment unit; And
The apparatus further includes an adaptive filter configured to update the coefficients of the adaptive filter to minimize the power of the signal output from the target signal cutoff unit, and to perform adaptive filtering by applying the coefficients of the adaptive filter to the output signal of the target signal cutoff unit. Speech recognition device characterized in that.

9. The method of claim 8,
And a signal subtractor configured to subtract and output an output signal of the adaptive filter unit from the target voice signal.

The method of claim 8, wherein the control unit
And a power of one of the output signal of the time axis alignment unit, the output signal of the target signal blocking unit, and the output signal of the adaptive filter unit as the second power.

The apparatus of claim 1, wherein the control unit
And when an end point of the speech section is detected, controlling the adaptive filter coefficients applied to the sections other than the speech section to be updated.

The apparatus of claim 1, wherein the control unit
And when the detection of the speech section is completed, outputting a control signal to the speech recognition unit to request to remove noise differentially in the speech section and the section other than the speech section.

In the method for recognizing a voice signal input to the voice recognition device,
Receiving at least one voice signal;
Extracting a target voice component to be recognized from the voice signal and outputting a target voice signal;
Calculating a first power that is the power of the target voice signal and a second power that is the power of the voice signal;
Calculating a ratio of the second power to the first power and detecting a speech section that is a section including the target speech component based on the ratio; And
And recognizing the voice signal existing in the voice section.

The method of claim 13, wherein the outputting of the target voice signal comprises:
Beamforming the target speech component, and outputting the beamformed target speech component as the target speech signal.

The method of claim 13, wherein the detecting of the voice interval
Calculating a ratio (first power / second power) of the first power to the second power; And
And determining the voice section when the ratio exceeds or exceeds a predetermined threshold.

The method of claim 15, wherein the detecting of the voice interval
Determining whether the voice section is performed in at least one frame unit; And
And determining the end point of the speech section when the ratio of the first power to the second power (first power / second power) is less than or less than the predetermined threshold value.