KR101042499B1

KR101042499B1 - Apparatus and method for processing speech recognition to improve speech recognition performance

Info

Publication number: KR101042499B1
Application number: KR1020080103028A
Authority: KR
Inventors: 류창선; 구명완; 김재인
Original assignee: 주식회사 케이티
Priority date: 2008-10-21
Filing date: 2008-10-21
Publication date: 2011-06-16
Also published as: KR20100043822A

Abstract

The present invention relates to a speech recognition processing apparatus and a method for improving speech recognition performance. The present invention relates to a speech recognition process at a low cost by performing a speech recognition process by limiting a range of a pronunciation dictionary based on user evaluation of a speech recognition result. We want to significantly improve performance (the speed and success rate of speech recognition).

To this end, the present invention, the speech recognition processing apparatus for improving the speech recognition performance, the user interface means for receiving a user evaluation for the user speech or speech recognition results for speech recognition from the user; And a voice recognition means for repeating the voice recognition for the input user voice while limiting the range of the pronunciation dictionary according to the user evaluation, and providing the result of the voice recognition to the user. do.

Speech Recognition, Speech Recognition Performance, Pronunciation Dictionary, Corrected Pronunciation Dictionary, Interaction, DTMF

Description

Speech recognition apparatus and method for improving speech recognition performance {APPARATUS AND METHOD FOR PROCESSING SPEECH RECOGNITION TO IMPROVE SPEECH RECOGNITION PERFORMANCE}

본 발명은 음성인식 처리 기술 분야에 관한 것으로, 더욱 상세하게는 사용자와의 인터액션을 통하여 음성인식 처리를 수행함으로써 저비용으로 음성인식 성능을 현저히 향상시킬 수 있는, 음성인식 성능의 향상을 위한 음성인식 처리 장치 및 그 방법에 관한 것이다.The present invention relates to the field of speech recognition technology, and more particularly, to perform speech recognition processing through interaction with a user, which can significantly improve speech recognition performance at low cost, and speech recognition processing for improving speech recognition performance. An apparatus and a method thereof are provided.

음성인식의 편리성 및 필요성은 많은 사람이 공감을 하고 있는 것으로, 미래 10대 기술의 하나로 항상 소개되고 있으며, 이러한 음성인식의 성능 향상을 위해 많은 노력이 기울여지고 있다. 특히, 사용자 인터페이스를 통하여 음성인식 기능을 제공하는 음성인식 서비스에 적용하기 위한 음성인식 기술은 현재에도 계속적으로 발전되어 가고 있다.Convenience and necessity of speech recognition are sympathetic to many people, and are always introduced as one of the 10 technologies of the future, and much effort is being made to improve the performance of such speech recognition. In particular, the voice recognition technology for applying to a voice recognition service that provides a voice recognition function through the user interface is still being developed.

하지만, 현재의 음성인식 기술은 시장에서 소비자가 원하는 수준의 서비스를 제공하는데 많은 어려움이 있다. 그 이유는 음성인식은 기본적으로 발음 사전에 기반하여 음성인식 대상에 대하여 음성인식을 수행하기 때문이다.However, current voice recognition technology has a lot of difficulties in providing the level of service desired by consumers in the market. The reason is that speech recognition basically performs speech recognition on a speech recognition target based on a pronunciation dictionary.

이를 상세히 설명하면, 서비스 시나리오에서 음성인식을 원하는 상태에서 사용자가 입력하는 음성이라는 아날로그 데이터를 음성인식 모듈 내에서 음성에 대해 디지털화한 데이터를 가지고 확률적 표현으로 변경한 후, 미리 훈련된 음소기반의 확률 값들과 주어진 발음사전에 기반한 음성인식 대상단어에 대한 확률 값들과 입력된 음성에 대한 확률 값들을 비교한다.In detail, in the service scenario, voice data input by a user in a service scenario is changed into a probabilistic representation of the digitized data of the voice in the voice recognition module, and then the pre-trained phoneme-based Probability values and probability values for the speech recognition target word based on the given pronunciation dictionary are compared with the probability values for the input voice.

비교 결과, 발음 사전에서 확률적으로 가장 가까운 대상단어를 찾아 음성인식 결과로 출력한다. 즉, 음성인식은 확률적으로 접근하기 때문에, 0.0001% 라도 틀릴 확률이 있게 되고, 이는 서비스 품질이 달라진다는 것을 의미한다. 따라서 이러한 문제로 인해, 음성인식 기술이 날로 발전해가는 오늘날에도 음성인식의 한계는 존재하게 되는 것이다.As a result of the comparison, the nearest target word is found in the pronunciation dictionary and output as a voice recognition result. That is, since speech recognition is approaching probabilistic, there is a probability that even 0.0001% is wrong, which means that the quality of service is different. Therefore, due to such a problem, there is a limit of speech recognition even in today's evolving speech recognition technology.

상기와 같은 문제는 단어기반 음성인식 서비스에 있어서 필수적으로 해결해야 하는 과제이다. 즉, 음성인식 기술의 한계가 존재하는 상황에서, 기존의 음성인식 방법을 통해 새로운 서비스를 개발하여 제공한다 해도, 결국에는 사용자의 요구를 만족시켜 주지 못하여 실패한 서비스가 되고 말 것이다.The above problem is an essential problem to be solved in the word-based speech recognition service. That is, in the situation where the limitation of the voice recognition technology exists, even if a new service is developed and provided through the existing voice recognition method, it will eventually fail to satisfy the user's needs.

도 1은 종래의 음성인식 서비스 방법에 대한 흐름도로서, 음성인식 처리 장치(음성인식 서버라 할 수도 있음)에서 수행되는 방법을 나타낸다.1 is a flowchart of a conventional voice recognition service method, and illustrates a method performed by a voice recognition processing device (also referred to as a voice recognition server).

종래의 음성인식 서비스는 도 1에 도시된 바와 같이, 먼저 서비스에 접속(전화 접속)한(100) 사용자에게 서비스에 대한 간단한 설명(서비스개요 안내 멘트 구동)을 한 후(102), 서비스 시나리오상의 특정 단계의 진행을 위한 음성 입력을 사용자에게 요구한다(104).In the conventional voice recognition service, as shown in FIG. 1, after a brief description of the service (service overview guidement operation) is performed to a user who is connected to the service (telephone connection) 100 (102), the service scenario Require a voice input for the progress of a particular step (104).

그에 따라, 사용자로부터 음성인식을 위한 사용자 음성을 입력받으면(106), 그 입력된 사용자 음성에 대하여 음성인식을 수행한다(108). 음성인식을 수행한 후에는 음성인식 결과를 해당 사용자에게 들려주며(110) 맞는지 틀리는지를 사용자에게 선택하게 한다(112).Accordingly, when a user voice for voice recognition is input from the user (106), voice recognition is performed on the input user voice (108). After the voice recognition is performed, the voice recognition result is presented to the user (110) and the user is selected whether it is correct or not (112).

음성인식 결과가 맞는지 여부를 확인한 결과(112), 음성인식 결과가 맞으면 음성인식 결과에 해당하는 서비스 시나리오를 계속 수행한다(114). 이와 달리, 만약 틀린 경우는 지금까지의 음성인식 실패 횟수가 예를 들어 3회가 되면 서비스 종료 멘트를 구동하여 사용자에게 들려준 후(116) 종료하지만, 음성인식 실패 횟수가 2회 이내의 경우에는 음성입력을 재요구하며(104), 이후의 과정이 반복 수행된다.As a result of checking whether the voice recognition result is correct (112), if the voice recognition result is correct, the service scenario corresponding to the voice recognition result is continued (114). On the other hand, if it is wrong, the number of voice recognition failures so far is three times, for example, after driving the service termination message to the user and ending (116), but if the number of voice recognition failures is less than two times, The voice input is requested again (104), and the subsequent process is repeatedly performed.

이와 같이 음성 재입력을 통한 음성인식 반복수행의 경우, 음성인식을 위한 도메인인 발음사전이 1만 단어였다면 반복 수행될 때마다 지속적으로 1만 단어 범위에서 음성인식을 수행한다. 결국 재시도에 따라 음성인식을 반복 수행하더라도, 음성 인식은 처음이나 두 번째나 세 번째나 똑같이 실패할 확률이 높다.As described above, in the case of repeating speech recognition through voice re-entry, if the pronunciation dictionary, which is a domain for speech recognition, was 10,000 words, the speech recognition is continuously performed in the range of 10,000 words whenever repeated. After all, even if the speech recognition is repeated according to the retry, the speech recognition is likely to fail the first time, the second time, or the third time.

상기와 같은 종래 기술은 음성인식 결과에 대해 맞는지, 틀리는지를 확인하여 처음부터 음성인식을 다시 수행하기 때문에, 비록 음성 인식 처리를 여러 번 반복 수행하더라도 그 음성인식의 결과는 음성인식 처리 서버의 처리 능력(Computing Power)이나 발음사전의 크기 등에 좌우된다는 문제가 있다.Since the prior art as described above confirms whether the voice recognition result is correct or incorrect, the voice recognition is performed again from the beginning. Even though the voice recognition process is repeated several times, the result of the voice recognition is the processing capability of the voice recognition processing server. There is a problem that depends on the (Computing Power) or the size of the pronunciation dictionary.

즉, 음성인식 처리 서버의 처리 능력이나 발음사전의 크기 등이 개선되지 않는 한, 반복 수행을 하더라도 음성인식 결과는 크게 달라지지 않을 것이다.That is, even if repeated processing is not performed, the voice recognition result will not be significantly changed unless the processing capacity of the voice recognition server or the size of the pronunciation dictionary is improved.

여기서, 음성인식 처리 서버의 처리 능력이나 발음사전의 크기는 음성인식 성능에 지대한 영향을 미치지만, 한번 구축된 후에는 이를 향상시키기 위한 작업에는 막대한 비용이 소요된다.Here, the processing capacity of the speech recognition processing server and the size of the pronunciation dictionary have a great influence on the speech recognition performance. However, once it is established, a task for improving the speech recognition cost is enormous.

따라서 특히, 전국을 범위로 하여 서비스를 제공하는 통신 사업자(예를 들어, 주식회사 케이티)에게는 시스템 구축에 따른 비용을 줄이기 위해, 처리 능력이나 발음사전 크기의 확장 없이도 음성인식 성능을 향상시키는 방식이 절실히 요구되고 있다.Therefore, especially for telecommunication carriers (eg, Katie Co., Ltd.) that provide services throughout the country, there is an urgent need to improve voice recognition performance without expanding processing capacity or pronunciation dictionary size to reduce the cost of system construction. It is required.

따라서 본 발명은 음성인식 처리 서버의 처리 능력(Computing Power)이나 발음사전 크기의 확장 없이도 음성인식 성능을 향상시킬 수 있는, 음성인식 성능의 향상을 위한 음성인식 처리 장치 및 그 방법을 제공하는데 그 목적이 있다.Accordingly, the present invention provides a speech recognition processing apparatus and method for improving speech recognition performance, which can improve speech recognition performance without expanding the computing power of the speech recognition processing server or the pronunciation dictionary. There is this.

본 발명의 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있으며, 본 발명의 실시예에 의해 더욱 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects and advantages of the present invention, which are not mentioned above, can be understood by the following description, and will be more clearly understood by the embodiments of the present invention. Also, it will be readily appreciated that the objects and advantages of the present invention may be realized by the means and combinations thereof indicated in the claims.

본 발명은 상기와 같은 목적을 달성하기 위하여, 사용자와의 인터액션을 통하여 음성인식 처리를 수행하는 것을 특징으로 한다.In order to achieve the above object, the present invention is characterized by performing a voice recognition process through interaction with a user.

즉, 본 발명은 음성인식 결과에 대한 사용자 평가에 기초하여 발음 사전의 범위를 제한하여 음성인식 처리를 재수행하는 것을 특징으로 한다.That is, the present invention is characterized in that the speech recognition process is performed again by limiting the range of the pronunciation dictionary based on the user evaluation of the speech recognition result.

더욱 구체적으로 본 발명은, 음성인식 성능의 향상을 위한 음성인식 처리 장치에 있어서, 사용자로부터 음성인식을 위한 사용자 음성 또는 음성인식 결과에 대한 사용자 평가를 입력받기 위한 사용자 인터페이스 수단; 및 상기 사용자 평가에 따라 발음 사전의 범위를 제한하여 상기 발음 사전을 재정의해가면서 상기 입력된 사용자 음성에 대한 음성인식을 반복 수행하여, 해당 음성인식 결과를 상기 사용자에게 제공하기 위한 음성인식 수단을 포함한다.More specifically, the present invention provides a speech recognition processing apparatus for improving speech recognition performance, comprising: user interface means for receiving a user evaluation of a user speech for speech recognition or a result of speech recognition from a user; And a voice recognition means for repeating the voice recognition for the input user voice while limiting the range of the pronunciation dictionary according to the user evaluation, and providing the result of the voice recognition to the user. do.

또한, 본 발명은, 음성인식 성능의 향상을 위한 음성인식 처리 방법에 있어서, 사용자로부터 음성인식을 위한 사용자 음성을 입력받아 음성인식을 수행하는 제1 음성인식 단계; 상기 사용자 음성에 대한 음성인식 결과를 상기 사용자에게 제공하여 사용자 평가를 받는 사용자평가 단계; 및 상기 사용자 평가에 따라 발음 사전의 범위를 제한하여 상기 발음 사전을 재정의한 후에 상기 사용자 음성에 대한 음성인식처리를 재수행하는 제2 음성인식 단계를 포함한다.The present invention also provides a voice recognition processing method for improving voice recognition performance, comprising: a first voice recognition step of performing a voice recognition by receiving a user voice for voice recognition from a user; A user evaluation step of receiving a user evaluation by providing a voice recognition result of the user voice to the user; And a second speech recognition step of redefining the pronunciation dictionary by limiting the range of the pronunciation dictionary according to the user evaluation and then re-performing the speech recognition process for the user's voice.

상기와 같은 발명은, 사용자와의 인터액션을 통하여 음성인식 처리를 수행함으로써 저비용으로 음성인식 성능을 현저히 향상시킬 수 있는 효과가 있다. 즉, 본 발명은 음성인식 결과에 대한 사용자 평가에 기초하여 발음 사전의 범위를 제한하여 음성인식 처리를 재수행함으로써 음성인식의 신속성과 정확성(성공률)을 현저히 향상시킬 수 있으며, 이로 인하여 소비자에게 더 나은 품질의 음성인식 서비스를 제공할 수 있는 효과가 있다.The above invention has the effect of significantly improving the speech recognition performance at low cost by performing the speech recognition process through interaction with the user. That is, the present invention can remarkably improve the speed and accuracy (success rate) of speech recognition by re-performing the speech recognition process by limiting the range of the pronunciation dictionary based on user evaluation of the speech recognition result, thereby further improving to consumers. It is effective to provide better quality voice recognition service.

또한, 본 발명은, 발음사전의 확장이나 처리 능력(Computing Power)의 확장 없이 기존 음성인식 시스템의 적은 수정만으로도 음성인식 성능을 현저히 증가시킬 수 있는 효과가 있으며, 이로 인하여 기존 음성인식 시스템의 성능 향상시킬 수 있을 뿐만 아니라 음성인식 인터페이스를 제공하는 신규 서비스의 개발을 활성화시킬 수 있는 효과가 있다.In addition, the present invention, it is possible to significantly increase the speech recognition performance by only a small modification of the existing speech recognition system without expansion of the pronunciation dictionary or processing power (Computing Power), thereby improving the performance of the existing speech recognition system In addition to this, it is possible to activate the development of a new service that provides a voice recognition interface.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 더욱 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.The above objects, features, and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, whereby those skilled in the art may easily implement the technical idea of the present invention. There will be. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail. Hereinafter, a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명에 따른 음성인식 성능의 향상을 위한 음성인식 처리 장치(서버)의 일실시예 구성도이다.2 is a block diagram of an embodiment of a speech recognition processing apparatus (server) for improving speech recognition performance according to the present invention.

본 발명에 따른 음성인식 처리 장치(서버)는 도 2에 도시된 바와 같이, 사용자 인터페이스부(221), 사용자음성 DB(222), 음성인식부(223), 최초 발음사전 DB(224), 및 수정 발음사전 DB(225)를 포함하여 이루어진다. 이하, 각각의 구성요소에 대하여 상세히 설명하기로 한다.Speech recognition processing apparatus (server) according to the present invention, as shown in Figure 2, the user interface unit 221, the user voice DB 222, voice recognition unit 223, the first pronunciation dictionary DB 224, and It includes a modified pronunciation dictionary DB (225). Hereinafter, each component will be described in detail.

사용자 인터페이스부(221)는 전화망(21)을 통하여 접속한 사용자 단말(20)로부터 음성인식을 위한 사용자 음성을 입력받거나, 또는 사용자 단말(20)에게 각종 안내 멘트를 제공하고 그에 따라 음성인식 결과에 대한 평가를 입력받는 기능을 수행한다. 여기서, 사용자 단말은 유선전화기, 이동통신 단말기, 인터넷 전화기 등과 같이 사용자 음성을 전달할 수 있는 단말기를 말한다. 그리고 전화망(21)은 상기 사용자 단말(20)과 연결되어 사용자의 음성을 음성인식 처리 장치(서버)(22)에 전달할 수 있는 네트워크를 의미하며, 이에는 이동통신망, 공중전화망, IP망 등이 포함된다.The user interface unit 221 receives a user voice for voice recognition from the user terminal 20 connected through the telephone network 21 or provides various announcements to the user terminal 20 and accordingly to the voice recognition result. Performs a function to receive an evaluation of the evaluation. Here, the user terminal refers to a terminal capable of transmitting user voice, such as a wired telephone, a mobile communication terminal, and an internet telephone. In addition, the telephone network 21 refers to a network connected to the user terminal 20 and capable of transmitting a user's voice to the voice recognition processing device (server) 22, which includes a mobile communication network, a public telephone network, an IP network, and the like. Included.

사용자음성 데이터베이스(DB)(222)는 사용자 인터페이스부(221)를 통하여 수신된 사용자 음성을 저장하며, 이렇게 사용자 음성을 저장하는 이유는 발음 사전을 재정의하여 음성인식을 재수행하는 경우에도 새로이 사용자 음성을 입력받지 않고 초기 입력된 사용자 음성을 사용하기 위함이다. 즉, 사용자 음성을 한번 입력하면, 음성인식 처리 장치(서버)는 그 입력된 사용자 음성에 대하여 발음사전을 재정의해가면서 수차례의 음성인식 과정을 반복적으로 수행한다.The user voice database (DB) 222 stores user voices received through the user interface unit 221. The reason for storing the user voices is that the user voices are newly generated even when the voice recognition is re-defined by redefining the pronunciation dictionary. This is to use the user's voice initially input without input. That is, once the user's voice is input, the voice recognition processing apparatus (server) repeatedly performs the voice recognition process several times while redefining the pronunciation dictionary for the input user voice.

음성인식부(223)는 사용자 인터페이스부(221)를 통하여 입력된 사용자 음성에 대하여 음성인식을 수행하여 해당 결과(음성인식 결과)를 제공하되, 사용자로부터 입력된 평가(음성인식 결과에 대한 평가)에 따라 발음 사전을 재정의하여 음성인식 처리를 재수행한다.The voice recognition unit 223 performs a voice recognition on the user voice input through the user interface unit 221 and provides a corresponding result (voice recognition result), but the evaluation input from the user (evaluation of the voice recognition result) According to the phonetic dictionary, the phonetic dictionary is re-defined by reproducing the phonetic dictionary.

즉, 음성인식부(223)는 최초의 발음 사전(224)을 이용하여 사용자 음성에 대한 음성인식을 수행하고 그 결과를 사용자 단말(20)에 제공한다. 그에 대한 응답으로 사용자 단말이 음성 인식결과에 결과에 대한 평가(사용자 평가)를 전송하면, 음성인식부(223)는 전송받은 사용자 평가에 기초하여 발음 사전을 재정의하고 그 재정의 발음사전(이하, '수정 발음사전'이라 한다)(225)을 이용하여 음성인식을 재수행한다.That is, the voice recognition unit 223 performs voice recognition on the user's voice using the first pronunciation dictionary 224 and provides the result to the user terminal 20. In response, when the user terminal transmits an evaluation of the result (user evaluation) to the speech recognition result, the speech recognition unit 223 redefines the pronunciation dictionary based on the received user evaluation and defines the pronunciation dictionary (hereinafter, The voice recognition is re-executed using 'correction pronunciation dictionary' (225).

다음으로, 음성인식 결과에 대한 평가(사용자 평가) 방식을 설명하면, 사용자 평가는 음성인식 결과 중에서 정상적으로 인식된 음절 개수에 해당하는 숫자의 사용자 단말 버튼을 누름으로써 이루어진다. 즉, 사용자가 숫자 3의 버튼을 누르면, 이는 DTMF(Dual-Tone Multi-Frequency) 신호로 변환되어 음성인식 처리 장치(서버)(22)에 전달되는데, 음성인식부(223)는 이렇게 전달된 DTMF 신호를 통하여 음성인식 결과 중 "3개의 음절"은 정상적으로 음성인식이 이루어졌다고 판단하게 되는 것이다.Next, when the evaluation (user evaluation) method for the speech recognition result is described, the user evaluation is performed by pressing a number of user terminal buttons corresponding to the number of syllables normally recognized among the speech recognition results. That is, when the user presses the button of the number 3, it is converted into a DTMF (Dual-Tone Multi-Frequency) signal and transmitted to the voice recognition apparatus (server) 22, and the voice recognition unit 223 transmits the DTMF thus transmitted. The "three syllables" in the speech recognition result through the signal is to determine that the speech recognition was made normally.

평가 방식은, 상기와 같은 버튼 입력에 의한 평가 방식에 한정되지 않고. 사용자가 "정상으로 인식된 음절 개수"를 직접 음성으로 말하는 방식(음성 입력에 의한 평가 방식)도 가능하다.The evaluation method is not limited to the evaluation method by the above button input. It is also possible for the user to directly speak "the number of syllables recognized normally" by voice (evaluation method by voice input).

한편, 수정 발음 사전을 생성하는 과정을 설명하면, 음성인식부(223)는 사용자 평가의 대상이 되었던 음성인식 결과를 배제하고 그 사용자 평가가 나타내는 음절 개수(음성인식이 정상적으로 이루어진 음절 개수)까지의 글자를 포함하는 단어까지로 발음 사전의 범위를 재정의한다.Meanwhile, referring to a process of generating a corrected pronunciation dictionary, the speech recognition unit 223 excludes the speech recognition result, which is the target of user evaluation, up to the number of syllables (the number of syllables in which speech recognition is normally performed) indicated by the user evaluation. Redefine the scope of the phonetic dictionary up to words containing letters.

예를 들어, 사용자가 "홍길동"이라고 말한 경우, 음성인식부(223)는 최초의 발음 사전을 대상으로 음성인식을 수행한 후, 그 결과로 "홍길서"라고 음성으로 제공한다. 음성인식 결과를 들은 사용자가 숫자 2의 버튼을 누르면("홍길"까지는 정상적으로 인식되었기 때문임), 음성인식부(223)는 사용자 평가의 대상이 되었던 음성인식 결과("홍길서")를 배제하고, "홍길"이라는 단어로 시작하는 모든 단어(예를 들면, 홍길남, 홍길동, 홍길북 등)를 대상으로 한다. 이렇게 대상이 한정된 발음 사전이 수정 발음 사전이 되는 것이다.For example, when the user says "Hong Gil-dong", the voice recognition unit 223 performs voice recognition on the first pronunciation dictionary and, as a result, provides "Hong Gil-seo" as a voice. When the user who listens to the voice recognition result presses the button of the number 2 (because it was normally recognized until "Hong Gil"), the voice recognition unit 223 excludes the voice recognition result ("Hong Gil Seo") that was the subject of user evaluation. , And all words beginning with the word "Hong Gil" (eg, Hong Gil Nam, Hong Gil Dong, Hong Gil North, etc.). The pronunciation dictionary in which the object is limited thus becomes a corrected pronunciation dictionary.

도 3은 본 발명에 따른 음성인식 성능의 향상을 위한 음성인식 처리 방법에 대한 일실시예 흐름도로서, 도 2에 도시된 바와 같은 음성인식 처리 장치(서버)에서 수행되는 방법을 나타낸다.3 is a flowchart illustrating an example of a speech recognition processing method for improving speech recognition performance according to the present invention, and illustrates a method performed by a speech recognition processing apparatus (server) as shown in FIG. 2.

먼저, 본 발명에 따른 음성인식 처리 방법에 대하여 개괄적으로 설명하기로 한다.First, the speech recognition processing method according to the present invention will be described generally.

사용자가 음성인식을 위하여 입력한 사용자 음성에 대해 음성인식을 수행한 후 그 음성인식 결과에 대해 몇 번째까지 인식결과가 맞는지를 사용자에게 물어본다.After performing the voice recognition on the user's voice input for the voice recognition, the user is asked how many times the recognition result is correct.

만약, 두 번째 음절까지 맞는 경우에 해당하여 사용자가 2번 버튼을 누르면, 음성인식 결과의 두 번째 음절까지 포함하는 단어를 발음사전에서 검색을 하여 그 검색된 내용(두 번째 음절까지 포함한 단어열)을 임시 발음사전(수정 발음사전)으로 재정의한다. 그리고 나서, 재정의된 발음사전(수정 발음사전)을 이용하여 기저장된 사용자 음성에 대해 음성인식을 수행한다.If the user presses the button 2 when the second syllable is correct, the word is searched in the pronunciation dictionary to include the second syllable of the speech recognition result and the searched content (the word string including the second syllable) is searched. Redefine as a temporary pronunciation dictionary. Then, voice recognition is performed on the pre-stored user's voice using the redefined pronunciation dictionary.

음성인식을 재수행한 후 그 결과를 다시 사용자에게 들려주어 몇 번째까지의 음절이 맞는지를 확인하여, 음성인식 결과가 모두 맞은 경우에는 해당 메뉴로 점프(이동)하며, 만약 틀린 경우에는 그에 해당하는 서비스 루틴으로 점프(이동)한다.After the voice recognition is re-executed, the result is again shown to the user, and the number of syllables is checked. If all the voice recognition results are correct, the user jumps to the corresponding menu. Jump to a routine.

구체적인 예를 들어, 본 발명을 설명하면 다음과 같다.For example, the present invention will be described below.

사용자가“미래기술연구소”라고 발화했다고 가정하자. 이때, 음성인식 결과가 “미래기술”이라고 나왔다고 한다면, 사용자는 4번 버튼을 누른다.Suppose a user utters "future research institute". At this time, if the voice recognition result is "future technology", the user presses the button four times.

그러면, 음성인식 처리 장치에서는 만 단어의 발음사전에서“미래기술”을 시작으로 포함된 단어를 찾아 그 찾은 음성인식 대상단어를 대상으로 새로운 발음사전을 재정의한다.Then, the speech recognition processing apparatus searches for a word included in the pronunciation dictionary of 10,000 words starting with "future technology" and redefines a new pronunciation dictionary for the found speech recognition target word.

그리고 나서, 그 재정의된 발음사전을 음성인식 엔진의 입력으로 하여 음성인식 엔진을 초기화한 후, 기저장되었던 사용자 음성을 음성인식 엔진의 입력으로 하여 음성인식을 재수행한다. 이 경우 사용자는 다시 음성입력을 할 필요가 없으며, 또한 음성인식 대상단어의 숫자가 획기적으로 감소하기 때문에 인식율이 현저히 향상될 수 있다.Then, the voice recognition engine is initialized using the redefined pronunciation dictionary as the input of the voice recognition engine, and the voice recognition is performed again using the previously stored user voice as the input of the voice recognition engine. In this case, the user does not need to input the voice again, and the recognition rate can be remarkably improved since the number of the words to be recognized is greatly reduced.

이하, 본 발명에 따른 음성인식 처리 방법을 도 3을 참조하여 상세히 설명하기로 한다.Hereinafter, a voice recognition processing method according to the present invention will be described in detail with reference to FIG. 3.

음성인식 처리 장치(22)는 전화 접속한(300) 사용자 단말에게 서비스에 대한 간단한 설명(서비스개요 안내 멘트 구동)을 한 후(302), 음성인식을 위한 음성 입력을 사용자에게 요구한다(304).The voice recognition processing device 22 makes a brief description of the service (service overview guidement drive) to the user terminal dialed 300 (302), and then requests the user for voice input for voice recognition (304). .

그에 따라, 사용자로부터 음성인식을 위한 사용자 음성을 입력받으면 음성인식 처리 장치(22)는 입력된 사용자 음성을 저장하고(306) 사용자 음성에 대하여 음성인식을 수행한다(308).Accordingly, when the user voice for voice recognition is input from the user, the voice recognition processing apparatus 22 stores the input user voice (306) and performs voice recognition on the user voice (308).

음성인식 처리 장치(22)는 음성인식을 수행한 후에는(308), 음성인식 결과를 해당 사용자에게 들려주면서 음성인식 결과가 맞는지 여부를 확인한다(310). 구체 적으로는 "음성인식 결과가 맞으면 #를 누르시고, 틀리면 별표(*)를 누르세요"와 같은 멘트를 제공한다.After performing the speech recognition (308), the speech recognition processing apparatus 22 checks whether the speech recognition result is correct while hearing the speech recognition result to the corresponding user (310). Specifically, it will give a comment like "Press # if the voice recognition result is correct, or press an asterisk (*) if it is incorrect".

사용자의 응답을 통해 음성인식 결과가 맞는지를 확인하여(312), 맞으면 해당 서비스 시나리오를 수행하고(314) 틀리면 지금까지의 음성인식 오류 횟수가 기준치(예를 들면, 4회)에 해당하는지를 확인한다(316).The user's response confirms whether the voice recognition result is correct (312), if correct, performs the corresponding service scenario (314), and if wrong, checks whether the number of voice recognition errors so far corresponds to the reference value (for example, 4 times). (316).

음성인식 오류 횟수 확인 결과, 음성인식 오류 횟수가 4회 미만이면 음성 인식 결과에 대한 평가(상세 평가)를 유도하는 멘트("들은 멘트에서 몇 번째 음절까지 맞는지를 확인하여 맞는 음절 개수에 해당하는 버튼을 누르시오")를 제공한다(318). 실시예에 따라서는 음성인식 결과가 맞는지 여부를 확인하는 과정(312)과 사용자 평가를 유도하는 멘트 제공 과정(318)을 하나의 과정을 통하여 구현할 수도 있다. 즉, "음성인식 결과가 맞으면 #를 누르시고, 틀리는 경우에는 몇 번째 음절까지 맞는지를 확인하여 맞는 음절 개수에 해당하는 버튼을 누르시오"와 같은 안내 멘트를 사용자에게 들려주고 사용자 단말로부터 입력되는 상황에 따라 처리하면 된다. 만약, 사용자 단말이 입력한 DTMF 신호가 "0"을 나타내는 경우에는 음성인식이 전부 틀린 경우이므로 재입력을 받기 위하여 "304"로 피드백할 것이다.If the number of speech recognition errors is less than 4, the message that induces the evaluation (detailed evaluation) of the speech recognition results ("the number of syllables in the comment" is checked and the button corresponding to the number of correct syllables). Press ”) (318). According to an embodiment, the process of checking whether the voice recognition result is correct 312 and the process of providing a comment 318 for inducing user evaluation may be implemented through one process. In other words, if the voice recognition result is correct, press #, and if it is wrong, check the number of syllables and press the button corresponding to the number of syllables. You can do it accordingly. If the DTMF signal input by the user terminal indicates "0", the voice recognition is all wrong, and thus the feedback will be fed back to "304" in order to receive re-entry.

사용자 평가를 유도하는 멘트를 들은 사용자는 음성인식 결과에 대하여 평가를 하게 되는데, 그 평가 방식에는 버튼 입력에 의한 평가 방식, 음성 입력에 의한 평가 방식 등이 있는데, 도 3과 관련해서는 버튼 입력에 의한 평가 방식을 예로 들어 설명하기로 한다.The user who listens to the comment that induces user evaluation evaluates the voice recognition result. The evaluation method includes a button input evaluation method and a voice input evaluation method. The evaluation method will be described as an example.

음성인식 처리 장치(22)가 사용자 단말(20)로부터 DTMF 신호를 수신하여 "정 상적으로 인식된 음절 개수"를 확인하고(320), 그 확인된 음절 개수(DTMF 신호가 나타내는 숫자)에 해당하는 음절까지를 포함하는 단어를 발음사전에서 찾아 새로운 발음사전(수정 발음 사전)으로 재정의한다(322).The speech recognition processing device 22 receives the DTMF signal from the user terminal 20 to check the "normally recognized syllable number" (320), and corresponds to the confirmed syllable number (the number indicated by the DTMF signal). The word including the syllable is found in the pronunciation dictionary and redefined as a new pronunciation dictionary (corrected pronunciation dictionary) (322).

음성인식 처리 장치(22)는 재정의된 발음사전(수정 발음사전)을 기준으로 하여 음성인식 엔진을 재 초기화한 후(324), "306"에서 저장되었던 사용자 음성을 입력으로 하여 음성인식을 재수행한 후(326), "310"으로 피드백함으로써 "310" 이하의 과정을 반복 수행한다.The speech recognition processing apparatus 22 re-initializes the speech recognition engine based on the redefined pronunciation dictionary (corrected pronunciation dictionary) (324), and performs the speech recognition by inputting the user's speech stored at "306". After 326, the process of "310" or less is repeated by feeding back to "310".

한편, 음성인식 오류 횟수 확인 결과(316), 음성인식 오류 횟수가 4회에 해당하면 사용자에게 음성 재입력 의사를 확인하여(328) 재입력 의사가 있는 경우에는 "304"로 피드백하고, 재입력 의사가 없는 경우에는 서비스 종료 멘트를 구동한다(330). 여기서, 음성 재입력 의사를 확인하는 과정은, 음성인식 처리 장치(22)가 "음성인식 오류가 기준횟수를 초과하였으니, 음성인식 대상단어를 재입력하고자 하면 #를 누르시고, 종료를 원하면 별표(*)를 누르세요"와 같은 멘트를 제공하여 재입력 의사를 확인하는 과정이다.On the other hand, if the voice recognition error check result 316, if the voice recognition error number is four times, the user confirms the voice re-entry intention (328), if there is a re-entry intention feedback to "304", and re-enter If there is no intention, the service termination message is driven (330). Here, in the process of confirming the voice re-entry intention, the voice recognition processing device 22 "The voice recognition error has exceeded the standard number of times, press # to re-enter the voice recognition target word, and to terminate the asterisk (* Press ") to confirm re-entry.

위에서 설명한 바와 같은 본 발명은 음성인식을 통하여 각 단계의 서비스메뉴를 선택하면서 진행되는 서비스, 음성인식을 이용한 자동 전화연결 서비스 등과 같은 음성인식을 이용하는 각종 서비스에 적용할 수 있다.As described above, the present invention can be applied to various services using voice recognition, such as a service proceeding while selecting a service menu of each stage through voice recognition, an automatic telephone connection service using voice recognition, and the like.

한편, 본 발명에 따른 음성인식 성능의 향상을 위한 음성인식 처리 장치 및 그 방법은 도 2 및 도 3에 도시된 바와 같이 전화망과 같은 네트워크로 연결된 사 용자 단말(20)로부터 사용자 음성 및 사용자 평가를 입력받는 "서버-클라이언트" 방식으로 구현될 수도 있지만, 실시예에 따라서는 이에 한정되지 않고 "단일의 음성인식 처리 장치"를 통해서도 구현될 수 있다.On the other hand, the voice recognition processing apparatus and method for improving the voice recognition performance according to the present invention is to perform the user voice and user evaluation from the user terminal 20 connected to the network, such as a telephone network as shown in FIG. Although it may be implemented in a "server-client" manner of input, depending on the embodiment is not limited to this may also be implemented through a "single voice recognition processing device".

즉, 음성인식 기능을 갖는 사용자 단말(휴대폰, PDA, 전자 사전 등)이나 각종 정보처리기기 등과 같은 "단일 시스템"으로 구현된 경우에는, 사용자나 운용자 등으로부터 해당 단일 시스템의 입력부를 통하여 사용자 음성이 입력되고 그에 대한 음성인식이 수행된다. 이때에는, 해당 단일 시스템에서의 버튼 입력, 터치 입력(터치패드, 터치스크린의 경우), 음성 입력 등과 같은 각종 입력을 통하여 사용자 평가(음성인식 결과에 대한 사용자 평가)가 이루어지게 된다.That is, when implemented as a "single system" such as a user terminal (mobile phone, PDA, electronic dictionary, etc.) or various information processing devices having a voice recognition function, the user's voice is input from the user or operator through the input unit of the single system. Input and voice recognition is performed. In this case, user evaluation (user evaluation of the voice recognition result) is performed through various inputs such as a button input, a touch input (for a touch pad and a touch screen), a voice input, and the like in the single system.

한편, 전술한 바와 같은 본 발명의 방법은 컴퓨터 프로그램으로 작성이 가능하다. 그리고 상기 프로그램을 구성하는 코드 및 코드 세그먼트는 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다.　또한, 상기 작성된 프로그램은 컴퓨터가 읽을 수 있는 기록매체(정보저장매체)에 저장되고, 컴퓨터에 의하여 판독되고 실행됨으로써 본 발명의 방법을 구현한다. 그리고 상기 기록매체는 컴퓨터가 판독할 수 있는 모든 형태의 기록매체를 포함한다.On the other hand, the method of the present invention as described above can be written in a computer program. And the code and code segments constituting the program can be easily inferred by a computer programmer in the art. In addition, the written program is stored in a computer-readable recording medium (information storage medium), and read and executed by a computer to implement the method of the present invention. The recording medium may include any type of computer readable recording medium.

즉, 본 발명은, 음성인식 성능의 향상을 위한 음성인식 처리를 위하여, 프로세서를 구비한 음성인식 처리 장치에, 사용자로부터 음성인식을 위한 사용자 음성을 입력받아 음성인식을 수행하는 제1 음성인식 기능; 상기 사용자 음성에 대한 음성인식 결과를 상기 사용자에게 제공하여 사용자 평가를 받는 사용자평가 기능; 및 상기 사용자 평가에 따라 발음 사전의 범위를 제한하여 상기 발음 사전을 재정의한 후에 상기 사용자 음성에 대한 음성인식처리를 재수행하는 제2 음성인식 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.That is, the present invention provides a first voice recognition function for performing voice recognition by receiving a user voice for voice recognition from a user for a voice recognition processing apparatus having a processor for voice recognition processing for improving voice recognition performance. ; A user evaluation function of receiving a user evaluation by providing a voice recognition result of the user voice to the user; And limiting the range of the pronunciation dictionary according to the user evaluation, and redefining the pronunciation dictionary, and then executing a second speech recognition function for re-executing the speech recognition process for the user's voice. To provide.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

도 1은 종래의 음성인식 서비스 방법에 대한 흐름도,1 is a flow chart for a conventional voice recognition service method,

도 2는 본 발명에 따른 음성인식 성능의 향상을 위한 음성인식 처리 장치(서버)의 일실시예 구성도,2 is a configuration diagram of an embodiment of a speech recognition processing apparatus (server) for improving speech recognition performance according to the present invention;

도 3은 본 발명에 따른 음성인식 성능의 향상을 위한 음성인식 처리 방법에 대한 일실시예 흐름도이다.3 is a flow chart of an embodiment of a speech recognition processing method for improving speech recognition performance according to the present invention.

* 도면의 주요부분에 대한 부호 설명* Explanation of symbols on the main parts of the drawings

20: 사용자 단말 21: 전화망20: user terminal 21: telephone network

22: 음성인식 처리 장치 221: 사용자 인터페이스부22: speech recognition device 221: a user interface unit

222: 사용자음성 DB 223: 음성인식부222: user voice DB 223: speech recognition unit

224: 최초 발음사전 DB 225: 수정 발음사전 DB224: first pronunciation dictionary DB 225: modified pronunciation dictionary DB

Claims

In the speech recognition processing device for improving the speech recognition performance,

User interface means for receiving a user's evaluation of a user's voice or a voice recognition result for voice recognition from a user; And

Speech recognition means for repeating the speech dictionary by limiting the range of the pronunciation dictionary in accordance with the user evaluation while repeating the speech dictionary for the input user voice, to provide the speech recognition result to the user

Speech recognition processing device comprising a.

The method of claim 1,

The user rating is,

Speech recognition processing device, characterized in that indicating the number of syllables normally recognized in the speech recognition results.

The method of claim 2,

The voice recognition means,

And a speech recognition device that redefines the scope of the phonetic dictionary to words including letters up to the number of syllables indicated by the user evaluation and excludes the voice recognition result that is the target of the user evaluation.

The method of claim 1,

The voice recognition means,

Speech recognition processing device for repeating the speech recognition using the user voice initially input through the user interface means.

The method of claim 2,

The user rating is,

Speech recognition processing device, characterized in that input via the DTMF signal or the user's voice.

The method of claim 5,

The user,

A voice recognition processing device, characterized in that the user of the telephone terminal connected through the telephone network.

The method of claim 1,

The user interface means,

Voice recognition processing device for inducing the input of the user voice or the user evaluation through the announcement.

In the speech recognition processing method for improving the speech recognition performance,

A first voice recognition step of receiving a user voice for voice recognition from the user and performing voice recognition;

A user evaluation step of receiving a user evaluation by providing a voice recognition result of the user voice to the user; And

A second speech recognition step of redefining the pronunciation dictionary by limiting the range of the pronunciation dictionary according to the user evaluation and then re-performing the speech recognition process for the user's voice;

Speech recognition processing method comprising a.

The method of claim 8,

The user rating is,

Speech recognition processing method characterized in that it indicates the number of syllables normally recognized in the speech recognition results.

The method of claim 9,

The second voice recognition step,

The voice recognition processing method of redefining the scope of the phonetic dictionary to words including letters up to the number of syllables indicated by the user evaluation is excluded, excluding the voice recognition result that is the target of the user evaluation.

The method of claim 8,

The second voice recognition step,

And re-performing the voice recognition process and feedback to the user evaluation step.

The method of claim 11,

The second voice recognition step,

The voice recognition processing method is repeatedly performed until the voice recognition is successful within a preset number of times.

The method of claim 9,

The user rating is,

Speech recognition processing method characterized in that it is input through a DTMF signal or the user's voice.