KR100855592B1

KR100855592B1 - Apparatus and method for robust speech recognition of speaker distance character

Info

Publication number: KR100855592B1
Application number: KR1020070003187A
Authority: KR
Inventors: 김경선
Original assignee: (주)에이치씨아이랩
Priority date: 2007-01-11
Filing date: 2007-01-11
Publication date: 2008-09-01
Also published as: KR20080066129A

Abstract

본 발명은 발성자 거리 특성에 강인한 음성인식 장치 및 그 방법을 제공하기 위한 것으로, 근거리 음성녹음부와 원거리 음성녹음부에 의해 입력된 음성을 동시에 받아들여 녹음하는 거리별 음성녹음부와; 상기 거리별 음성녹음부에 출력된 거리별 음성을 입력받아 외부잡음을 추정하여 녹음음성에서 제거하는 외부잡음 제거부와; 상기 외부잡음 제거부에서 외부잡음이 제거된 녹음음성을 입력받아 원거리와 근거리의 거리특성이 반영된 입력 음성 중에서 어느 음성이 음성인식 성능을 높일 수 있는 지 확인하여 선정하는 입력음성 선정부와; 상기 입력음성 선정부에서 선정된 음성을 입력받아 음성인식을 수행하는 음성인식부;를 포함하여 구성함으로서, 원거리 음성인식 성능과 근거리 음성인식 성능이 동시에 높게 나오며 외부 잡음에 강인하도록 할 수 있게 되는 것이다.The present invention is to provide a voice recognition apparatus and method that is robust to the speaker distance characteristics, and a distance-specific voice recording unit for receiving the voice input by the near-field voice recording unit and the remote voice recording unit at the same time; An external noise removing unit for receiving the distance-specific speech output from the distance-based voice recording unit and estimating the external noise to remove from the recording voice; An input voice selecting unit which receives the recorded voice from which the external noise has been removed and receives and selects which voice can increase the voice recognition performance among the input voices reflecting the distance characteristics of the remote and short distances; By including the voice selected by the input voice selector to perform the voice recognition; speech recognition unit comprising a, the long-distance speech recognition performance and the near-field speech recognition performance is high at the same time to be able to be robust to external noise. .

발성자, 거리 특성, 음성인식, 음성녹음, 외부잡음 Speaker, distance characteristic, voice recognition, voice recording, external noise

Description

Apparatus and method for robust speech recognition of speaker distance character}

도 1은 본 발명의 일 실시예에 의한 발성자 거리 특성에 강인한 음성인식 장치의 블록구성도이다.1 is a block diagram of a speech recognition apparatus robust to speaker distance characteristics according to an embodiment of the present invention.

도 2는 도 1에서 거리별 음성녹음부의 상세블록도이다.FIG. 2 is a detailed block diagram of a voice recording unit for each distance in FIG. 1.

도 3은 도 1에서 외부잡음 제거부의 상세블록도이다.3 is a detailed block diagram of the external noise removing unit of FIG. 1.

도 4는 도 1에서 입력음성 선정부의 상세블록도이다.FIG. 4 is a detailed block diagram of the input voice selector of FIG. 1.

도 5는 도 1에서 음성인식부의 상세블록도이다.FIG. 5 is a detailed block diagram of the voice recognition unit of FIG. 1.

도 6은 본 발명의 일 실시예에 의한 발성자 거리 특성에 강인한 음성인식 방법을 보인 흐름도이다.6 is a flowchart illustrating a voice recognition method that is robust to speaker distance characteristics according to an embodiment of the present invention.

도 7은 도 6에서 거리별 음성녹음 동작에 대한 상세흐름도이다.FIG. 7 is a detailed flowchart illustrating a voice recording operation for each distance in FIG. 6.

도 8은 도 6에서 외부잡음 제거 동작에 대한 상세흐름도이다.FIG. 8 is a detailed flowchart illustrating an external noise removing operation of FIG. 6.

도 9는 도 6에서 입력음성 선정 동작에 대한 상세흐름도이다.9 is a detailed flowchart illustrating an operation of selecting an input voice in FIG. 6.

도 10은 도 6에서 음성인식 동작에 대한 상세흐름도이다.FIG. 10 is a detailed flowchart of the voice recognition operation of FIG. 6.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

100 : 거리별 음성녹음부 101 : 녹음음성크기 분류부100: voice recording unit by distance 101: recording voice size classification unit

102 : 샘플화자 입력게인 계산부 103 : 실사용자 입력게인 계산부102: sample speaker input gain calculation unit 103: real user input gain calculation unit

104 : 입력게인 지정부 110 : 근거리 음성녹음부104: input gain designation unit 110: short-range voice recording unit

120 : 원거리 음성녹음부 200 : 외부잡음 제거부120: remote voice recording unit 200: external noise removing unit

201 : 시간축 특성추출부 202 : 시간축 음성검출부201: time base feature extractor 202: time base voice detector

203 : 주파수 변환부 204 : 음성주파수 특성추출부203: frequency converter 204: voice frequency characteristic extraction unit

205 : 잡음주파수 특성추출부 206 : 주파수잡음 제거부205: noise frequency characteristic extraction unit 206: frequency noise removal unit

207 : 주파수축 음성검출부 208 : 음성검출결과 선정부207: frequency axis voice detection unit 208: voice detection result selection unit

210 : 제 1 외부잡음 제거부 220 : 제 2 외부잡음 제거부210: first external noise removing unit 220: second external noise removing unit

300 : 입력음성 선정부 301 : 제 1 음가명확도 신뢰수치 선정부300: input voice selector 301: first sound accuracy accuracy confidence value selector

302 : 제 1 음질왜곡수치 선정부 303 : 제 1 SNR 수치 선정부302: First sound quality distortion value selector 303: First SNR numerical selector

304 : 제 2 음가명확도 신뢰수치 선정부 305 : 제 2 음질왜곡수치 선정부304: second sound quality accuracy confidence value selecting unit 305: second sound quality distortion value selecting unit

306 : 제 2 SNR 수치 선정부 307 : 음성선정부306: second SNR numerical selection unit 307: voice preselection

400 : 음성인식부 401 : 특징추출부400: voice recognition unit 401: feature extraction unit

402 : 디코더 403 : 후처리부402: decoder 403: post-processing unit

본 발명은 음성인식에 관한 것으로, 특히 원거리 음성인식 성능과 근거리 음성인식 성능이 동시에 높게 나오며 외부 잡음에 강인하도록 하기에 적당하도록 한 발성자 거리 특성에 강인한 음성인식 장치 및 그 방법에 관한 것이다.TECHNICAL FIELD The present invention relates to speech recognition, and more particularly, to a speech recognition apparatus and a method which is robust to a speaker distance characteristic such that the distance speech recognition performance and the near speech recognition performance are both high and suitable to be robust to external noise.

일반적으로 음성인식이란 음성을 기계로 식별하는 것으로서, 음성파를 주파수 분석하여 모음을 특징짓는 음역(音域) 또는 그것과 등가(等價)인 특징을 추출해서 분리하는 방법이 가장 보편적으로 시도되고 있다. 주파수 분석 결과를 시간적으로 연속 기록한 것을 소나그램이라 하는데, 잘 훈련하면 거기 기록된 무늬를 보고 시각적으로 음성을 알아볼 수 있다. 그러나 말하는 사람이 많거나 말수가 많으면 오차가 생길 수가 있다.In general, speech recognition refers to speech as a machine, and the most common method is to extract and separate a sound region or an equivalent feature that characterizes a vowel by frequency analysis of the sound wave. . The sonagram is a continuous record of the results of frequency analysis. When trained well, the patterns recorded there can be visually recognized. But if you talk too much or talk too much, you may get errors.

종래의 음식인식 장치는 외부 잡음 때문에 근거리에서 발성하는 음성을 인식하는 것이 일반적이나, 최근에는 마이크어레이 기술을 적용하여 원거리에서도 음성인식을 적용하고 있다. 즉, 종래기술에는 원거리용 마이크를 이용하여 원거리 음성 인식을 가능하게 한 기술과 복수의 마이크를 이용한 마이크어레이 기술이 있다.Conventional food recognition devices generally recognize speech that is spoken at a short distance due to external noise, but recently, microphone recognition technology is applied to speech recognition even at a long distance. That is, in the prior art, there is a technology that enables remote speech recognition using a remote microphone and a microphone array technology using a plurality of microphones.

그러나 원거리 음성 인식 기술은 근거리 음성 인식 성능이 낮다는 취약점과 외부 잡음에 취약하다는 약점이 있다. 또한 마이크어레이 기술은 잡음에 강인하다는 장점이 있으나 역시 근거리 음성 인식에 취약점이 있고 고가의 하드웨어 장치가 필요하며 시스템 구성이 복잡하고 정교해야 하는 설치 상의 문제가 있다.However, far speech recognition technology has weaknesses such as low near speech recognition performance and weakness of external noise. In addition, the microphone array technology has the advantage of being robust against noise, but also has a weakness in short-range speech recognition, an expensive hardware device, and an installation problem that requires complicated and sophisticated system configuration.

이처럼 종래의 음성인식 장치는 음성입력 크기 때문에 인식할 수 있는 거리에 영향을 많이 받는다.As such, the conventional speech recognition apparatus is affected by the distance that can be recognized because of the size of the speech input.

또한 원거리 인식기는 근거리 입력에 대해서 음성입력이 너무 커 음의 왜곡이 생길 수 있어 인식률이 현저히 떨어지게 되며, 근거리 인식기는 원거리 입력에 대해서 음성입력이 너무 작아 음을 구분할 수 없어 인식이 불가능하게 되는 문제점이 있었다.In addition, the far field recognizer has a large speech input that is too large for the near field input, so that the recognition rate decreases significantly. there was.

이에 본 발명은 상기와 같은 종래의 제반 문제점을 해결하기 위해 제안된 것으로, 본 발명의 목적은 원거리 음성인식 성능과 근거리 음성인식 성능이 동시에 높게 나오며 외부 잡음에 강인하도록 할 수 있는 발성자 거리 특성에 강인한 음성인식 장치 및 그 방법을 제공하는데 있다.Therefore, the present invention has been proposed to solve the conventional problems as described above, and an object of the present invention is to provide a speaker distance characteristic capable of achieving both a long distance speech recognition performance and a near speech recognition performance at the same time and being robust to external noise. The present invention provides a robust speech recognition device and a method thereof.

상기와 같은 목적을 달성하기 위하여 본 발명의 일실시예에 의한 발성자 거리 특성에 강인한 음성인식 장치는,In order to achieve the above object, the voice recognition device robust to the speaker distance characteristic according to an embodiment of the present invention,

근거리 음성녹음부와 원거리 음성녹음부에 의해 입력된 음성을 동시에 받아들여 녹음하는 거리별 음성녹음부와; 상기 거리별 음성녹음부에 출력된 거리별 음성을 입력받아 외부잡음을 추정하여 녹음음성에서 제거하는 외부잡음 제거부와; 상기 외부잡음 제거부에서 외부잡음이 제거된 녹음음성을 입력받아 원거리와 근거리의 거리특성이 반영된 입력 음성 중에서 어느 음성이 음성인식 성능을 높일 수 있는 지 확인하여 선정하는 입력음성 선정부와; 상기 입력음성 선정부에서 선정된 음성을 입력받아 음성인식을 수행하는 음성인식부;를 포함하여 이루어짐을 그 기술적 구성상의 특징으로 한다.A distance-specific voice recording unit for simultaneously receiving and recording voice input by the near voice recording unit and the remote voice recording unit; An external noise removing unit for receiving the distance-specific speech output from the distance-based voice recording unit and estimating the external noise to remove from the recording voice; An input voice selecting unit which receives the recorded voice from which the external noise has been removed and receives and selects which voice can increase the voice recognition performance among the input voices reflecting the distance characteristics of the remote and short distances; And a voice recognition unit for receiving a voice selected by the input voice selection unit and performing voice recognition.

상기와 같은 목적을 달성하기 위하여 본 발명의 일실시예에 의한 발성자 거리 특성에 강인한 음성인식 방법은,In order to achieve the above object, the voice recognition method robust to the speaker distance characteristic according to an embodiment of the present invention,

거리별 음성녹음부에서 근거리와 원거리에서 입력된 음성을 동시에 받아들여 녹음하도록 하는 제 1 단계와; 상기 제 1 단계 후 외부잡음 제거부에서 거리별 음성에서 외부잡음을 추정하여 녹음음성에서 제거하는 제 2 단계와; 상기 제 2 단계 후 입력음성 선정부에서 외부잡음이 제거된 녹음음성을 입력받아 원거리와 근거리의 거리특성이 반영된 입력 음성 중에서 어느 음성이 음성인식 성능을 높일 수 있는 지 확인하여 선정하도록 하는 제 3 단계와; 상기 제 3 단계 후 음성인식부에서 선정된 음성을 입력받아 음성인식을 수행하는 제 4 단계;를 포함하여 수행함을 그 기술적 구성상의 특징으로 한다.A first step of allowing voice recording units by distance to simultaneously receive and record voice input from near and far distances; A second step of estimating external noise from distance-specific voices and removing them from the recorded voice after the first step; After the second step, the input voice selecting unit receives the recorded voice from which the external noise has been removed and checks and selects which voice can improve the voice recognition performance among the input voices reflecting the distance characteristics of the remote and short distances. Wow; And a fourth step of performing voice recognition by receiving the voice selected by the voice recognition unit after the third step.

이하, 상기와 같은 본 발명, 발성자 거리 특성에 강인한 음성인식 장치 및 그 방법의 기술적 사상에 따른 일실시예를 도면을 참조하여 설명하면 다음과 같다.Hereinafter, an exemplary embodiment according to the present invention as described above, a voice recognition device robust to speaker distance characteristics, and a technical idea of the method will be described with reference to the accompanying drawings.

이에 도시된 바와 같이, 근거리 음성녹음부(110)와 원거리 음성녹음부(120)에 의해 입력된 음성을 동시에 받아들여 녹음하는 거리별 음성녹음부(100)와; 상기 거리별 음성녹음부(100)에 출력된 거리별 음성을 입력받아 외부잡음을 추정하여 녹음음성에서 제거하는 외부잡음 제거부(200)와; 상기 외부잡음 제거부(200)에서 외부잡음이 제거된 녹음음성을 입력받아 원거리와 근거리의 거리특성이 반영된 입력 음성 중에서 어느 음성이 음성인식 성능을 높일 수 있는 지 확인하여 선정하는 입력음성 선정부(300)와; 상기 입력음성 선정부(300)에서 선정된 음성을 입력받아 음성인식을 수행하는 음성인식부(400);를 포함하여 구성된 것을 특징으로 한다.As shown therein, the voice recording unit 100 for each distance for receiving and recording voices input by the near voice recording unit 110 and the far voice recording unit 120 at the same time; An external noise removing unit 200 which receives the distance-specific voice output to the distance-based voice recording unit 100 and estimates external noise to remove from the recorded voice; An input voice selecting unit which receives the recorded voice from which the external noise has been removed from the external noise removing unit 200 and checks and selects which voice can improve the voice recognition performance among the input voices reflecting the distance characteristics of the far and near distances; 300); And a voice recognition unit 400 which receives voices selected by the input voice selection unit 300 and performs voice recognition.

이에 도시된 바와 같이, 상기 거리별 음성녹음부(100)의 상기 거리별 음성녹 음부(110)와 원거리 음성녹음부(120)는 각각, 샘플화자의 음성을 입력받아 녹음음성의 크기를 분류하는 녹음음성 크기분류부(101)와; 상기 녹음음성 크기분류부(101)의 출력을 입력받아 샘플화자의 입력게인을 계산하여 거리별 모드에 따른 입력게인값을 출력하여 사전 파라미터 세팅을 수행하는 샘플화자 입력게인 계산부(102)와; 실사용자의 음성을 입력받아 실사용자의 입력게인을 계산하여 실사용자의 파라미터를 세팅하는 실사용자 입력게인 계산부(103)와; 상기 샘플화자 입력게인 계산부(102)에서 거리별 모드의 입력게인값을 입력받고, 상기 실사용자 입력게인 계산부(103)에서 출력된 실사용자의 입력게인값을 입력받아 입력게인을 지정하여 거리별 음성을 출력하는 입력게인 지정부(104);를 포함하여 구성된 것을 특징으로 한다.As shown in the drawing, the distance-based voice recording unit 110 and the distance-based voice recording unit 120 of the distance-specific voice recording unit 100 respectively receive a sampler's voice to classify the size of the recording voice. Recording voice size classification unit 101; A sampler input gain calculation unit 102 which receives the output of the recording voice size classifying unit 101, calculates an input gain of a sample speaker, outputs an input gain value according to a distance-specific mode, and performs preset parameter setting; An actual user input gain calculator 103 configured to receive a voice of the actual user, calculate an input gain of the actual user, and set a parameter of the actual user; The sample speaker input gain calculator 102 receives an input gain value of a mode for each distance, receives an input gain of an actual user output from the real user input gain calculator 103, and specifies an input gain to determine a distance. And an input gain designation unit 104 for outputting a star voice.

이에 도시된 바와 같이, 상기 외부잡음 제거부(200)는, 상기 거리별 음성녹음부(100)에서 출력된 근거리 또는 원거리의 거리별 음성을 입력받아 시간축의 특성을 추출하는 시간축 특성추출부(201)와; 상기 시간축 특성추출부(201)의 출력을 입력받아 시간축에서의 음성을 검출하는 시간축 음성검출부(202)와; 상기 거리별 음성녹음부(100)에서 출력된 근거리 또는 원거리의 거리별 음성을 입력받아 주파수로 변환시키는 주파수 변환부(203)와; 상기 주파수 변환부(202)의 출력을 입력받아 음성주파수의 특성을 추출하는 음성주파수 특성추출부(204)와; 상기 주파수 변환부(202)의 출력을 입력받아 잡음주파수의 특성을 추출하는 잡음주파수 특성추출부(205)와; 상기 음성주파수 특성추출부(204)와 상기 잡음주파수 특성추출부(205) 의 출력을 입력받아 주파수 측에서의 잡음 특성을 제거하는 주파수잡음 제거부(206)와; 상기 주파수잡음 제거부(206)의 출력을 입력받아 주파수축에서의 음성을 검출하는 주파수축 음성검출부(207)와; 상기 시간축 음성검출부(202)와 상기 주파수축 음성검출부(207)의 출력을 입력받아 잡음특성을 반영하여 음성검출결과를 선정하여 음성을 출력하는 음성검출결과 선정부(208);를 포함하여 구성된 것을 특징으로 한다.As shown in the drawing, the external noise removing unit 200 receives a time-based feature extracting unit 201 for extracting a feature of a time-base by receiving a voice by distance of a short distance or a long distance output from the voice recording unit 100 for each distance. )Wow; A time axis voice detector 202 which receives the output of the time axis feature extractor 201 and detects voice on the time axis; A frequency converter 203 which receives voices of distances or distances output from the distance-based voice recording unit 100 and converts them into frequencies; A voice frequency characteristic extractor 204 which receives the output of the frequency converter 202 and extracts a voice frequency characteristic; A noise frequency characteristic extractor 205 which receives the output of the frequency converter 202 and extracts a noise frequency characteristic; A frequency noise canceller 206 for receiving the outputs of the voice frequency characteristic extractor 204 and the noise frequency characteristic extractor 205 and removing noise characteristics at a frequency side; A frequency axis voice detector 207 which receives the output of the frequency noise remover 206 and detects voice on a frequency axis; And a voice detection result selector 208 which receives the output of the time axis voice detector 202 and the frequency axis voice detector 207 and selects a voice detection result based on a noise characteristic to output voice. It features.

이에 도시된 바와 같이, 상기 입력음성 선정부(300)는, 상기 외부잡음 제거부(200)에서 외부잡음이 제거된 녹음음성을 입력받아 근거리 음성출력에 대한 음가명확도 신뢰수치를 선정하여 음가가 명확한 발성은 녹음음량의 크기와 피치 정보의 명료성으로 산출하는 제 1 음가명확도 신뢰수치 선정부(301)와; 상기 제 1 음가명확도 신뢰수치 선정부(301)의 출력을 입력받아 주파수 특성의 분석시 근접한 음성구간 내에서 불연속 주파수 특성을 야기하는 정도에 의해 음질왜곡수치를 선정하는 제 1 음질왜곡수치 선정부(302)와; 상기 제 1 음질왜곡수치 선정부(302)의 출력을 입력받아 잡음의 상대적인 크기 값인 SNR(Signal to Noise Ratio, 신호대 잡음비) 수치를 선정하는 제 1 SNR 수치 선정부(303)와; 상기 외부잡음 제거부(200)에서 외부잡음이 제거된 녹음음성을 입력받아 원거리 음성출력에 대한 음가명확도 신뢰수치를 선정하여 음가가 명확한 발성은 녹음음량의 크기와 피치 정보의 명료성으로 산출하는 제 2 음가명확도 신뢰수치 선정부(304)와; 상기 제 2 음가명확도 신뢰수치 선정부(304)의 출력을 입력받아 주파수 특성의 분석시 근접한 음성구간 내에서 불연속 주파수 특성을 야기하는 정도에 의해 음질왜곡수치를 선정하는 제 2 음질왜곡수치 선정부(305)와; 상기 제 2 음질왜곡수치 선정부(305)의 출력을 입력받아 잡음의 상대적인 크기 값인 SNR 수치를 선정하는 제 2 SNR 수치 선정부(306)와; 상기 제 1 및 제 2 SNR 수치 선정부(303, 306)의 출력을 입력받아 인식률이 높은 음성입력을 선정하여 선정된 음성을 출력하는 음성선정부(307);를 포함하여 구성된 것을 특징으로 한다.As shown in the drawing, the input voice selector 300 receives the recorded voice from which the external noise has been removed from the external noise remover 200, and selects a loudness confidence confidence value for the near-range voice output. A first speech intelligibility confidence level selector 301 which calculates the sound clarity by the clarity of the volume and pitch information of the recording volume; A first sound quality distortion value selecting unit which selects the sound quality distortion value by the degree of causing the discontinuous frequency characteristic in the adjacent voice section when analyzing the frequency characteristic by receiving the output of the first sound accuracy accuracy confidence value selecting unit 301; 302; A first SNR value selector 303 which receives an output of the first sound quality distortion value selector 302 and selects a signal to noise ratio (SNR) value which is a relative magnitude value of noise; The external noise removing unit 200 receives the recorded voice from which the external noise has been removed and selects a loudness confidence confidence value for the far-field voice output, and the sound having a clear voice is calculated based on the magnitude of the recorded volume and the clarity of the pitch information. 2 sound accuracy accuracy confidence level selection unit 304; A second sound quality distortion level selector which receives the output of the second sound accuracy accuracy confidence value selector 304 and selects a sound quality distortion value by a degree of causing discontinuous frequency characteristics in the adjacent voice section when analyzing frequency characteristics; 305; A second SNR value selector 306 that receives an output of the second sound quality distortion value selector 305 and selects an SNR value that is a relative magnitude of noise; And a voice selecting unit 307 that receives the outputs of the first and second SNR value selecting units 303 and 306 and selects a voice input having a high recognition rate and outputs a selected voice.

이에 도시된 바와 같이, 상기 음성인식부(400)는, 상기 입력음성 선정부(300)에서 선정된 음성을 입력받아 음가마다 고유한 특성을 추출하는 특징추출부(401)와; 상기 특징추출부(401)의 출력을 입력받아 발성문법과 음향모델에 의해 디코딩(Decoding)을 수행하는 디코더(402)와; 상기 디코더(402)의 출력을 입력받아 언어적인 특성과 발성 시점의 명확성을 고려하여 후처리를 수행하여 인식결과를 출력하는 후처리부(403);를 포함하여 구성된 것을 특징으로 한다.As shown therein, the voice recognition unit 400 includes: a feature extracting unit 401 for receiving a voice selected by the input voice selecting unit 300 and extracting unique characteristics for each voice value; A decoder 402 which receives the output of the feature extractor 401 and decodes the speech grammar and the acoustic model; And a post-processing unit 403 for receiving the output of the decoder 402 and performing post-processing in consideration of linguistic characteristics and clarity of speech timing to output a recognition result.

이에 도시된 바와 같이, 거리별 음성녹음부(100)에서 근거리와 원거리에서 입력된 음성을 동시에 받아들여 녹음하도록 하는 제 1 단계(ST1)와; 상기 제 1 단계 후 외부잡음 제거부(200)에서 거리별 음성에서 외부잡음을 추정하여 녹음음성에서 제거하는 제 2 단계(ST2)와; 상기 제 2 단계 후 입력음성 선정부(300)에서 외부잡음이 제거된 녹음음성을 입력받아 원거리와 근거리의 거리특성이 반영된 입력 음 성 중에서 어느 음성이 음성인식 성능을 높일 수 있는 지 확인하여 선정하도록 하는 제 3 단계(ST3)와; 상기 제 3 단계 후 음성인식부(400)에서 선정된 음성을 입력받아 음성인식을 수행하는 제 4 단계(ST4);를 포함하여 수행하는 것을 특징으로 한다.As shown therein, the first step ST1 allows the voice recording unit 100 for each distance to simultaneously receive and record the voice input from the near and the far distances; A second step (ST2) of estimating external noise from distance-specific voices and removing them from the recorded voice after the first step; After the second step, the input voice selecting unit 300 receives the recorded voice from which the external noise is removed and checks and selects which voice can improve the voice recognition performance among the input voices reflecting the distance characteristics of the far and near distance. Performing a third step ST3; And a fourth step (ST4) of receiving a voice selected by the voice recognition unit 400 after the third step, and performing voice recognition.

이에 도시된 바와 같이, 상기 제 1 단계는, 상기 거리별 음성녹음부(100)는 샘플화자의 음성을 입력받아 녹음음성의 크기를 분류하는 제 11 단계(ST11)와; 상기 제 11 단계 후 샘플화자의 입력게인을 계산하여 거리별 모드에 따른 입력게인값을 출력하여 사전 파라미터 세팅을 수행하는 제 12 단계(ST12)와; 실사용자의 음성을 입력받아 실사용자의 입력게인을 계산하여 실사용자의 파라미터를 세팅하는 제 13 단계(ST13)와; 상기 제 12 단계에서의 거리별 모드의 입력게인값을 입력받고, 상기 제 13 단계에서의 실사용자의 입력게인값을 입력받아 입력게인을 지정하여 거리별 음성을 출력하는 제 14 단계(ST14);를 포함하여 수행하는 것을 특징으로 한다.As shown therein, the first step includes: an eleventh step ST11 of classifying the voice quality of the recorded voice by receiving the voice of the sampler 100; A twelfth step (ST12) of calculating the input gain of the sampler after the eleventh step and outputting an input gain value according to a distance-specific mode (ST12); A thirteenth step (ST13) of receiving a voice of the real user, calculating input gain of the real user, and setting parameters of the real user; A fourteenth step (ST14) of receiving an input gain value of the distance-specific mode in the twelfth step, receiving an input gain value of the real user in the thirteenth step, and specifying an input gain to output voice for each distance; It characterized in that to perform including.

이에 도시된 바와 같이, 상기 제 2 단계는, 상기 거리별 음성녹음부(100)에서 출력된 근거리 또는 원거리의 거리별 음성을 입력받아 시간축의 특성을 추출하는 제 21 단계(ST21)와; 상기 제 21 단계 후 시간축에서의 음성을 검출하는 제 22 단계(ST22)와; 상기 거리별 음성녹음부(100)에서 출력된 근거리 또는 원거리의 거리별 음성을 입력받아 주파수로 변환시키는 제 23 단계(ST23)와; 상기 제 23 단계 후 음성주파수와 잡음주파수의 특성을 추출하는 제 24 단계(ST24)와; 상기 제 24 단계 후 주파수 측에서의 잡음 특성을 제거하는 제 25 단계(ST25)와; 상기 제 25 단계 후 주파수축에서의 음성을 검출하는 제 26 단계(ST26)와; 상기 제 22 단계와 상기 제 26 단계 수행 후 잡음특성을 반영하여 음성검출결과를 선정하여 음성을 출력하는 제 27 단계(ST27);를 포함하여 수행하는 것을 특징으로 한다.As shown in the drawing, the second step may include a twenty-first step (ST21) of extracting a characteristic of a time axis by receiving a voice of a short distance or a long distance output from the voice recording unit 100 for each distance; A twenty-second step (ST22) of detecting voice on a time axis after the twenty-first step; A twenty-third step (ST23) of receiving the voice of each distance of a short distance or a long distance output from the voice recording unit 100 for each distance and converting the voice to frequency; A twenty-fourth step (ST24) of extracting characteristics of voice frequency and noise frequency after the twenty-third step; A twenty-fifth step (ST25) for removing a noise characteristic on the frequency side after the twenty-fourth step; A twenty-sixth step (ST26) for detecting voice on the frequency axis after the twenty-fifth step; And performing a twenty-seventh step (ST27) of selecting a voice detection result by reflecting noise characteristics after performing the twenty-second step and the twenty-sixth step (ST27).

이에 도시된 바와 같이, 상기 제 3 단계는, 상기 외부잡음 제거부(200)에서 외부잡음이 제거된 녹음음성을 입력받아 근거리 음성출력에 대한 음가명확도 신뢰수치를 선정하여 음가가 명확한 발성은 녹음음량의 크기와 피치 정보의 명료성으로 산출하는 제 31 단계(ST31)와; 상기 제 31 단계 후 주파수 특성의 분석시 근접한 음성구간 내에서 불연속 주파수 특성을 야기하는 정도에 의해 음질왜곡수치를 선정하는 제 32 단계(ST32)와; 상기 제 32 단계 후 잡음의 상대적인 크기 값인 SNR 수치를 선정하는 제 33 단계(ST33)와; 상기 외부잡음 제거부(200)에서 외부잡음이 제거된 녹음음성을 입력받아 원거리 음성출력에 대한 음가명확도 신뢰수치를 선정하여 음가가 명확한 발성은 녹음음량의 크기와 피치 정보의 명료성으로 산출하는 제 34 단계(ST34)와; 상기 제 34 단계 후 주파수 특성의 분석시 근접한 음성구간 내에서 불연속 주파수 특성을 야기하는 정도에 의해 음질왜곡수치를 선정하는 제 35 단계(ST35)와; 상기 제 35 단계 후 잡음의 상대적인 크기 값인 SNR 수치를 선정하는 제 36 단계(ST36)와; 상기 제 33 단계와 상기 제 36 단계 수행 후 인식률이 높은 음성입력을 선정하여 선정된 음성을 출력하는 제 37 단계(ST37);를 포함하여 수행 하는 것을 특징으로 한다.As shown in the drawing, the third step receives the recorded voice from which the external noise has been removed by the external noise removing unit 200, and selects a loudness confidence confidence value for the near voice output to record voices having a clear voice. A thirty-first step (ST31) for calculating loudness and clarity of pitch information; A thirty-second step (ST32) of selecting sound quality distortion values by the degree of causing discontinuous frequency characteristics in the adjacent voice section when analyzing the frequency characteristic after the thirty-first step; A thirty-third step (ST33) of selecting an SNR value which is a relative magnitude value of noise after the thirty-second step; The external noise removing unit 200 receives the recorded voice from which the external noise has been removed and selects a loudness confidence confidence value for the far-field voice output, and the sound having a clear voice is calculated based on the magnitude of the recorded volume and the clarity of the pitch information. Step 34 (ST34); A thirty-fifth step (ST35) of selecting a sound quality distortion value by the degree of causing a discontinuous frequency characteristic in an adjacent voice section when analyzing the frequency characteristic after the thirty-fourth step; A 36 th step ST36 of selecting an SNR value which is a relative magnitude value of noise after the 35 th step; And a 37 th step (ST37) of selecting a voice input having a high recognition rate after performing the 33rd step and the 36th step and outputting the selected voice.

이에 도시된 바와 같이, 상기 제 4 단계는, 상기 입력음성 선정부(300)에서 선정된 음성을 입력받아 음가마다 고유한 특성을 추출하는 제 41 단계(ST41)와; 상기 제 41 단계 후 발성문법과 음향모델에 의해 디코딩을 수행하는 제 42 단계(ST42)와; 상기 제 42 단계 후 언어적인 특성과 발성 시점의 명확성을 고려하여 후처리를 수행하여 인식결과를 출력하는 제 43 단계(ST43);를 포함하여 수행하는 것을 특징으로 한다.As shown in the drawing, the fourth step includes: a 41st step (ST41) of receiving a voice selected by the input voice selecting unit 300 and extracting a unique characteristic for each voice value; A 42nd step (ST42) of performing decoding by the speech grammar and the acoustic model after the 41st step; And a 43rd step (ST43) of outputting a recognition result by performing post-processing in consideration of linguistic characteristics and clarity of speech timing after the 42nd step.

이와 같이 구성된 본 발명에 의한 발성자 거리 특성에 강인한 음성인식 장치 및 그 방법의 바람직한 실시예를 첨부한 도면에 의거하여 상세히 설명하면 다음과 같다. 하기에서 본 발명을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서, 이는 사용자, 운용자의 의도 또는 판례 등에 따라 달라질 수 있으며, 이에 따라 각 용어의 의미는 본 명세서 전반에 걸친 내용을 토대로 해석되어야 할 것이다.Referring to the accompanying drawings, preferred embodiments of the voice recognition device and the method robust to the speaker distance characteristics according to the present invention configured as described above are as follows. In the following description of the present invention, detailed descriptions of well-known functions or configurations will be omitted if it is determined that the detailed description of the present invention may unnecessarily obscure the subject matter of the present invention. In addition, terms to be described below are terms defined in consideration of functions in the present invention, which may vary according to intention or precedent of a user or an operator, and thus, the meaning of each term should be interpreted based on the contents throughout the present specification. will be.

먼저 본 발명은 원거리 음성인식 성능과 근거리 음성인식 성능이 동시에 높게 나오며 외부 잡음에 강인하도록 하고자 한 것이다. 이를 위해 본 발명에서는 사용자가 거리에 관계없이 높은 음성인식 성능을 확보하도록 한다.First of all, the present invention aims to provide high distance speech recognition performance and near speech recognition performance at the same time and to be robust to external noise. To this end, the present invention allows the user to ensure high voice recognition performance regardless of distance.

그래서 본 발명에서는 거리에 따른 인식률 저하를 해결하기 위한 장치로 2개 이상의 마이크를 연결하여 복수의 음성입력 단계를 수행하는 특징이 있으며, 자동으로 복수의 음성입력 중에서 안정된 인식 성능을 확보할 수 있는 음성입력을 선택할 수 있다.Thus, in the present invention, a device for solving a decrease in recognition rate according to distance has a feature of performing a plurality of voice input steps by connecting two or more microphones, and a voice capable of automatically securing stable recognition performance among a plurality of voice inputs. Input can be selected.

예를 들어, 일반 가정환경에서 벽면에 부착된 월패드의 내장형 음성인식기를 이용하여 가전제어를 하고자 할 때, 3미터 정도 떨어진 거실 소파에서 음성명령을 내리는 경우와 LCD(Liquid Crystal Display, 액정 디스플레이)를 볼 수 있을 정도의 가까운 거리(0.5미터)에서 음성명령을 내리는 경우를 동시에 처리하여 인식률 저하가 없도록 한다.For example, if you want to control home appliances using wall pad's built-in voice recognition device in a normal home environment, you can give voice commands from the living room sofa about 3 meters and LCD (Liquid Crystal Display). Simultaneously process voice commands at a close range (0.5m) so that the recognition rate is not reduced.

도 1은 본 발명의 전체적인 구성도를 나타낸 것이고, 도 6은 그 동작흐름을 보인 것이다.Figure 1 shows the overall configuration of the present invention, Figure 6 shows the operation flow.

그래서 본 발명에서는 음성녹음을 2개의 그룹으로 분리하여 수행한다. 즉, 근거리 음성녹음부(110)와 원거리 음성녹음부(120)로 음성 입력을 동시에 받아들인 후, 입력된 음성 부분 중에서 자동으로 잡음 부분을 추정하여 녹음 음성에서 제거해주는 외부잡음 제거부(200)를 거친다.Thus, in the present invention, voice recording is performed by separating into two groups. That is, after receiving a voice input simultaneously to the near voice recording unit 110 and the far voice recording unit 120, the external noise removing unit 200 automatically estimates the noise portion of the input voice portion and removes it from the recorded voice. Go through.

이후 각각의 거리 특성이 반영된 입력 음성 중에서 어느 음성이 음성인식 성능을 높일 수 있는 지 확인하는 입력음성 선정부(300)를 거쳐 음성인식부(400)에서 음성인식을 수행하여 인식 결과를 출력한다.After that, the voice recognition unit 400 performs a voice recognition through the input voice selection unit 300 for checking which voices among the input voices reflecting the respective distance characteristics can improve the voice recognition performance, and outputs the recognition result.

거리 특성에 강인한 본 음성인식의 주된 적용은 홈 네트워크용 가전 제어 기기이다. 예를 들어 거실 소파에서 월 패드까지의 거리는 3미터 내외로 원거리 음성 녹음 장치가 이 거리의 음성 입력을 해결하도록 한다. 한편, 일반적인 사용자는 월 패드 근처에서 다양한 정보 서비스를 직접 체험하기를 원하므로 LCD 창을 볼 수 있는 0.5미터 내외의 근접한 거리에서의 음성 입력은 근거리 음성 녹음 장치가 대응하도록 한다. 이렇게 2미터 이상의 거리 차가 있을 때 거리 별로 녹음 음성에 대한 대응을 하지 않으면 소리가 너무 크게 녹음 되어 왜곡 현상이 발생하거나, 소리가 너무 작게 녹음되어 음가가 불명확하게 될 수 있어 음성인식의 효용성을 보장할 수 없게 된다.The main application of this voice recognition, which is robust to the distance characteristic, is home appliance control device for home network. For example, the distance from the living room sofa to the wall pad is about three meters, allowing the remote voice recorder to resolve the voice input at this distance. On the other hand, the general user wants to experience various information services directly near the wall pad, so that the voice input device at a close distance of about 0.5 meters that can see the LCD window allows the near voice recording apparatus to respond. If there is a distance difference of more than 2 meters, if you do not respond to the recorded voice by distance, the sound may be recorded too loud and distortion may occur, or the sound may be recorded too small to make the sound price unclear. It becomes impossible.

도 2는 거리별 음성녹음부(100)의 세부 구성도이고, 도 6은 그의 동작흐름을 보인 것이다.2 is a detailed configuration diagram of the voice recording unit 100 for each distance, and FIG. 6 shows its operation flow.

그래서 지정 거리에서 평상시의 자연스런 목소리를 발성하여 음성을 녹음하였을 시, 디지털화된 녹음 음성의 크기가 16비트 해상도에서 5,000에서 20,000 정도의 수치를 갖는다면 인식률은 최적의 성능을 확보할 수 있다. 따라서 거리별 음성녹음부(100)에 녹음하여 각각 5,000에서 20,000 사이의 입력 크기가 될 수 있도록 입력 게인을 조절하는 작업이 필요하다.Therefore, when the voice is recorded by uttering a natural voice at a specified distance, the recognition rate can be optimized if the size of the digitized recorded voice has a value of 5,000 to 20,000 at 16-bit resolution. Therefore, it is necessary to adjust the input gain so that the distance is recorded in the voice recording unit 100 to be an input size of 5,000 to 20,000, respectively.

먼저 (가) 사전 파라미터 세팅은 최종 음성입력 장치가 완료되기 전에 미리 많은 화자 발성의 통계특성에 따라 입력 게인 파라미터를 세팅하는 작업을 수행한다. 녹음음성크기 분류부(101)는 남, 녀, 노, 소를 감안한 복수의 샘플 화자로부터 지정거리에서 정상발성, 큰 발성, 작은 발성 3가지 녹음을 받아 전체 녹음된 음성을 크기 별로 분류하며, 최소 녹음 크기 수준과 최대 녹음 크기 수준을 알아낸다. 이 값을 이용하여 샘플화자 입력게인 계산부(102)를 통해 모든 녹음 입력이 5,000 에서 20,000 정도의 수치가 되도록 게인 값을 산출한다. 이 수치는 거리 별 3가지 모드(대/중/소)에 해당하는 파라미터 값으로 보관한다.First, (a) prior parameter setting is performed to set input gain parameters according to the statistical characteristics of many speaker utterances before the final voice input device is completed. The recording voice size classifying unit 101 classifies all the recorded voices according to size by receiving three recordings from a plurality of sample speakers considering male, female, oar, and small at a specified distance from a normal utterance, a loud utterance, and a small utterance. Find out the recording size level and the maximum recording size level. Using this value, the gain value is calculated through the sampler input gain calculation unit 102 so that all recording inputs are numerical values of 5,000 to 20,000. This value is stored as a parameter value corresponding to three modes (large / medium / small) for each distance.

이러한 샘플화자로부터 미리 구해진 파라미터 값은 (나)에서와 같이 실사용자에게도 적용되어 입력게인 지정부(104)에서 적정의 입력 게인 수치로 지정될 수 있도록 거리 별 3가지 모드의 파라미터 중 하나로 세팅하는 과정을 수행한다.The parameter value obtained in advance from the sampler is applied to the real user as in (b), and is set to one of three mode parameters for each distance so that the input gain designation unit 104 can specify an appropriate input gain value. Do this.

도 3 및 도 7은 외부잡음 제거부(200)의 세부 구성과 그 동작을 보인 것이다.3 and 7 show the detailed configuration and operation of the external noise removing unit 200.

그래서 근거리, 원거리에 대해 각각 외부잡음 제거 장치가 필요하나 두 개의 장치는 동일한 구조를 갖는다.Therefore, external noise canceling devices are required for short distance and long distance, respectively, but the two devices have the same structure.

한편 거리 별로 음성 입력이 되어 음가가 충실히 입력되었다고 가정하여도 녹음 음성에 잡음 입력이 크게 입력되면 오인식 혹은 인식 거절 확률이 매우 높게 된다. 따라서 외부 잡음 요소를 최대한 제거하고 음성부분을 추출하는 것이 인식 성능을 높이는 중요한 요소가 된다.On the other hand, even if the voice input is faithfully input by distance, even if the noise input is largely input to the recorded voice, the probability of false recognition or recognition is very high. Therefore, it is important to remove the external noise as much as possible and extract the speech part to increase the recognition performance.

외부 잡음 제거는 주파수 측에서의 잡음 특성을 제거하는 부분인 주파수 잡음 제거부(206)와 시간 축과 주파수 축에서 동시에 음성부분을 검출하는 시간축 음성검출부(202)와 주파수축 음성검출부(207)로 부분으로 구성된다. 동시에 검출된 음성구간은 잡음 특성을 반영하여 음성검출결과 선정부(208)에서 최종 음성검출 결과를 선정한다.External noise rejection is divided into a frequency noise canceller 206, which is a part for removing noise characteristics on the frequency side, and a time-axis speech detector 202 and a frequency-axis speech detector 207, which simultaneously detect speech parts on the time axis and the frequency axis. It is composed. At the same time, the detected voice section reflects the noise characteristic, and the voice detection result selecting unit 208 selects the final voice detection result.

도 4 및 도 9는 입력 음성을 선정하는 입력음성 선정부(300)의 세부 구성과 그 동작을 보인 것이다.4 and 9 show the detailed configuration and operation of the input voice selection unit 300 for selecting the input voice.

그래서 입력음성 선정부(300)에서는 검출된 2개의 음성 구간 중 음성인식 성능이 높게 나오는 음성을 선정하는 작업을 수행한다.Therefore, the input voice selector 300 selects a voice having a high voice recognition performance from the detected two voice sections.

음가가 명확한 발성은 녹음 음량의 크기와 피치 정보의 명료성으로 산출되므로, 제 1 및 제 2 음가명확도 신뢰수치 선정부(301, 304)에서 근거리 또는 원거리 음성 출력에 대해 각각 음가명확도 신뢰수치를 선정한다.Since the sound with a clear voice value is calculated by the clarity of the volume and pitch information of the recording volume, the first and second loudness confidence confidence level selectors 301 and 304 select the loudness confidence confidence value for the near or far speech output, respectively. Select.

음질 왜곡 수치는 녹음 크기가 16비트 해상도를 넘는 경우처럼, 주파수 특성을 분석하였을 시 근접한 음성 구간 내에서 불연속 주파수 특성을 야기하는 정도에 의해 산출되므로, 제 1 및 제 2 음질왜곡수치 선정부(302, 305)에서 근거리 또는 원거리 음성출력에 대해 각각 음질왜곡수치를 선정한다.Since the sound quality distortion value is calculated by the degree of causing the discontinuous frequency characteristic in the adjacent speech section when the frequency characteristic is analyzed, such as when the recording size exceeds 16 bit resolution, the first and second sound quality distortion value selecting units 302 , 305) select sound quality distortion values for near or far speech output.

또한 제 1 및 제 2 SNR 수치 선정부(303, 306)에서 선정하는 SNR 수치는 신호 대 잡음 비율에 해당하는 값으로서 잡음의 상대적인 크기 값에 해당한다. 이러한 3개의 수치인 음가명확도 신뢰수치, 음질왜곡수치, SNR 수치를 이용하여 음성선정부(307)에서는 인식률이 높게 되는 음성입력을 선정한다.In addition, the SNR values selected by the first and second SNR value selectors 303 and 306 correspond to signal-to-noise ratios and correspond to relative magnitude values of noise. The voice selection unit 307 selects a voice input having a high recognition rate by using these three values, namely, sound accuracy accuracy, sound quality distortion, and SNR.

도 5 및 도 10은 음성인식부(400)의 구성과 그 동작을 보인 것이다.5 and 10 show the configuration and operation of the voice recognition unit 400.

그래서 특징추출부(401)에서는 음가마다 고유한 특성을 추출한다.Thus, the feature extraction unit 401 extracts unique characteristics for each sound value.

또한 디코더(402)에서는 사전에 학습되어진 음향모델과 입력할 어휘의 발성 관련 규칙을 이용하여 인식을 수행한다.In addition, the decoder 402 performs recognition using a sound model that has been learned in advance and a speech related rule of an input vocabulary.

또한 후처리부(403)에서는 인식 부분의 결과를 언어적인 특성과 발성 시점의 명확성 등을 고려하여 후처리를 수행한다.In addition, the post-processing unit 403 performs post-processing in consideration of the linguistic characteristics and clarity of the timing of speech.

이처럼 본 발명은 원거리 음성인식 성능과 근거리 음성인식 성능이 동시에 높게 나오며 외부 잡음에 강인하게 되는 것이다.As described above, the present invention provides both the long distance speech recognition performance and the short range speech recognition performance at the same time, and is robust to external noise.

이상에서 살펴본 바와 같이, 본 발명에 의한 발성자 거리 특성에 강인한 음성인식 장치 및 그 방법은 원거리 음성인식 성능과 근거리 음성인식 성능이 동시에 높게 나오며 외부 잡음에 강인한 효과가 있게 된다.As described above, the speech recognition apparatus robust to the speaker distance characteristic and the method thereof according to the present invention have a long range speech recognition performance and a near speech recognition performance are high at the same time, and has a robust effect on external noise.

또한 본 발명은 거리에 관계없이 음성인식이 가능한 장치를 제작할 수 있다.In addition, the present invention can manufacture a device capable of speech recognition regardless of the distance.

더불어 본 발명은 잡음제거 장치를 이용하여 원거리 시 생기는 잡음을 제거할 수 있어 서비스 성능이 우수한 시나리오가 가능해진다. 이러한 장치는 두 개 이상 마이크를 통해 입력된 신호를 분석하여 내부에서 자동으로 인식 성능이 우수한 음성 입력을 선정하게끔 구성되어 있어 사용자의 설치에 어려움이 없도록 되어 있다. 따라서 일반 홈 네트워크 시스템이나 모발 로봇의 음성 인터페이스에 바로 적용이 가능한 효과가 있다.In addition, the present invention can remove the noise generated at a long distance by using a noise canceling device enables a scenario with excellent service performance. Such a device is configured to analyze a signal input through two or more microphones and to automatically select a voice input having excellent recognition performance internally, so that the installation of the user is not difficult. Therefore, it can be directly applied to the voice interface of a general home network system or a hair robot.

이상에서 본 발명의 바람직한 실시예에 한정하여 설명하였으나, 본 발명은 이에 한정되지 않고 다양한 변화와 변경 및 균등물을 사용할 수 있다. 따라서 본 발명은 상기 실시예를 적절히 변형하여 응용할 수 있고, 이러한 응용도 하기 특허청구범위에 기재된 기술적 사상을 바탕으로 하는 한 본 발명의 권리범위에 속하게 됨은 당연하다 할 것이다.Although the above has been described as being limited to the preferred embodiment of the present invention, the present invention is not limited thereto and various changes, modifications, and equivalents may be used. Therefore, the present invention can be applied by appropriately modifying the above embodiments, it will be obvious that such application also belongs to the scope of the present invention based on the technical idea described in the claims below.

Claims

A distance-specific voice recording unit for simultaneously receiving and recording voice input by the near voice recording unit and the remote voice recording unit;

An external noise removing unit for receiving the distance-specific speech output from the distance-based voice recording unit and estimating the external noise to remove from the recording voice;

An input voice selecting unit which receives the recorded voice from which the external noise has been removed and receives and selects which voice can increase the voice recognition performance among the input voices reflecting the distance characteristics of the remote and short distances;

A voice recognition unit for receiving a voice selected by the input voice selection unit and performing voice recognition;

Robust speech recognition device characterized in that configured to include a speaker distance characteristic.

The method according to claim 1,

The near voice recording unit and the remote voice recording unit of the distance-based voice recording unit, respectively,

A recording voice size classifying unit for receiving a sample speaker's voice and classifying the size of the recording voice;

A sampler input gain calculator configured to receive an output of the recording voice size classifier, calculate an input gain of a sample speaker, output an input gain value according to a distance-specific mode, and perform preset parameter setting;

A real user input gain calculator configured to receive a voice of the real user, calculate an input gain of the real user, and set a parameter of the real user;

The sample speaker input gain calculator inputs an input gain value of a distance-specific mode, receives an input gain value of an actual user output from the real user input gain calculator, and specifies an input gain to output voice for each distance. Gain designation unit;

The method according to claim 1,

The external noise removing unit,

A time axis feature extraction unit configured to extract a feature of a time axis by receiving voices of distances from a short distance or a long distance output from the distance-based voice recording unit;

A time axis voice detector which receives the output of the time axis feature extractor and detects a voice on a time axis;

A frequency converter configured to receive voices of distances or distances output from the distance-based voice recording units and convert them into frequencies;

A voice frequency characteristic extractor which receives the output of the frequency converter and extracts a voice frequency characteristic;

A noise frequency characteristic extractor configured to receive an output of the frequency converter and extract a noise frequency characteristic;

A frequency noise canceller which receives the outputs of the voice frequency characteristic extractor and the noise frequency characteristic extractor and removes noise characteristics at a frequency side;

A frequency axis voice detector which receives the output of the frequency noise canceller and detects voice on a frequency axis;

A voice detection result selection unit which receives the output of the time axis voice detection unit and the frequency axis voice detection unit and selects a voice detection result based on a noise characteristic to output voice;

The method according to claim 1,

The input voice selection unit,

The first loudness clarity is calculated by selecting the loudness clarity confidence value for the short-range voice output by receiving the recorded voice from which the external noise is removed from the external noise removing unit, and calculating the vocal clarity based on the loudness of the recorded volume and the clarity of the pitch information. A confidence level selection unit;

A first sound quality distortion value selecting unit which receives the output of the first sound accuracy accuracy confidence value selecting unit and selects the sound quality distortion values by the degree of causing discontinuous frequency characteristics in the adjacent voice section when analyzing the frequency characteristics;

A first SNR value selector which receives an output of the first sound quality distortion value selector and selects an SNR value that is a relative magnitude of noise;

The second loudness clarity that is obtained by inputting the recorded voice from which the external noise has been removed is selected by the external noise removing unit and selecting the confidence value of the loudness accuracy of the far-range voice output, and calculating the vocal clarity based on the loudness of the recorded volume and the clarity of the pitch information. A confidence level selection unit;

A second sound quality distortion value selecting unit which receives the output of the second sound accuracy accuracy confidence value selecting unit and selects the sound quality distortion values by the degree of causing discontinuous frequency characteristics in the adjacent voice section when analyzing the frequency characteristics;

A second SNR value selector which receives an output of the second sound quality distortion value selector and selects an SNR value that is a relative magnitude of noise;

A voice selecting unit for receiving the outputs of the first and second SNR numerical selecting units and selecting a voice input having a high recognition rate to output the selected voice;

The method according to any one of claims 1 to 4,

The voice recognition unit,

A feature extraction unit which receives a voice selected by the input voice selecting unit and extracts a unique characteristic for each voice value;

A decoder configured to receive the output of the feature extractor and to decode by a speech grammar and an acoustic model;

A post-processing unit which receives the output of the decoder and performs post-processing in consideration of linguistic characteristics and clarity of speech timing and outputs a recognition result;

A first step of allowing voice recording units by distance to simultaneously receive and record voice input from near and far distances;

A second step of estimating external noise from distance-specific voices and removing them from the recorded voice after the first step;

After the second step, the input voice selecting unit receives the recorded voice from which the external noise has been removed and checks and selects which voice can improve the voice recognition performance among the input voices reflecting the distance characteristics of the remote and short distances. Wow;

A fourth step of receiving a voice selected by the voice recognition unit after the third step and performing voice recognition;

Robust speech recognition method characterized in that configured to include a speaker distance characteristic.

The method according to claim 6,

The first step is,

An eleventh step of classifying the volume of the recorded voice by receiving the voice of the sampler;

A twelfth step of calculating the input gain of the sampler after the eleventh step and outputting an input gain value according to a distance-specific mode to perform pre-parameter setting;

Receiving a voice of the real user, calculating an input gain of the real user, and setting parameters of the real user;

A fourteenth step of inputting an input gain value of the distance-specific mode in the twelfth step, receiving an input gain value of the real user in the thirteenth step, and specifying an input gain to output voice for each distance;

The method according to claim 6,

The second step,

A twenty-first step of extracting a characteristic of a time axis by receiving a voice of a short distance or a long distance output from the voice recording unit by distance;

A twenty-second step of detecting voice on a time axis after the twenty-first step;

A twenty-third step of receiving a voice for each short distance or long distance output from the voice recording unit for each distance and converting the voice to a frequency;

A twenty-fourth step of extracting characteristics of voice frequency and noise frequency after the twenty-third step;

A twenty-fifth step of removing noise characteristics on the frequency side after the twenty-fourth step;

A twenty-sixth step of detecting voice on the frequency axis after the twenty-fifth step;

Performing a twenty-seventh step and a twenty-sixth step, selecting a voice detection result based on a noise characteristic, and outputting a voice;

The method according to claim 6,

The third step,

In step 31, the external noise removing unit receives the recorded voice from which the external noise has been removed and selects the loudness accuracy confidence value for the near-range voice output, and calculates the sound quality with the clarity of the recorded volume and the clarity of the pitch information. ;

A thirty-second step of selecting a sound quality distortion value by a degree of causing discontinuous frequency characteristics in an adjacent voice section when analyzing the frequency characteristic after the thirty-first step;

A thirty-third step of selecting an SNR value that is a relative magnitude value of noise after said thirty-second step;

The 34th step of calculating the sound clarity confidence value for the far-range audio output by receiving the recorded voice from the external noise removing unit is removed, the sound clarity is calculated by the loudness of the recording volume and the clarity of the pitch information; ;

A thirty-fifth step of selecting sound quality distortion values by the degree of causing discontinuous frequency characteristics in adjacent voice sections when analyzing the frequency characteristic after the thirty-fourth step;

A 36th step of selecting an SNR value which is a relative magnitude value of noise after the 35th step;

A thirty-seventh step of selecting a voice input having a high recognition rate after outputting the thirty-third step and the thirty-sixth step to output the selected voice;

The method according to any one of claims 6 to 9,

The fourth step,

A forty-first step of receiving a voice selected by the input voice selector and extracting a unique characteristic for each voice value;

A forty-second step of performing decoding by the speech grammar and the acoustic model after the forty-first step;

A 43rd step of performing post-processing in consideration of linguistic characteristics and clarity of speech timing after the 42nd step and outputting a recognition result;

Robust speech recognition method characterized in that performed by the speaker distance characteristics.