KR101207435B1

KR101207435B1 - Interactive speech recognition server, interactive speech recognition client and interactive speech recognition method thereof

Info

Publication number: KR101207435B1
Application number: KR1020120074364A
Authority: KR
Inventors: 이상호; 강국진; 구동욱; 김훈
Original assignee: 다이알로이드(주)
Priority date: 2012-07-09
Filing date: 2012-07-09
Publication date: 2012-12-04

Abstract

PURPOSE: An interactive speech voice recognition server, an interactive speech recognition client and an interactive speech recognizing method are provided to enable a client to designate a speech structure specialized in a service and to obtain the speech recognition result suitable for the service. CONSTITUTION: An interactive speech recognition server receives conversation type information from a client. The interactive speech recognition server receives at least one recognition target speech based on conversation information from the client. The interactive speech recognition server recognizes the recognition target speech. The interactive speech recognition server transmits the speech recognition result about the recognition target speech to the client. [Reference numerals] (200) Input module; (210) Information transmitting module; (220) Voice transmitting module; (230) Result receiving module; (240) TTS module

Description

Interactive speech recognition server, interactive speech recognition client and interactive speech recognition method

본 발명은 대화형 음성인식 서버, 대화형 음성인식 클라이언트 및 대화형 음성인식 방법에 관한 것으로서, 보다 상세하게는 클라이언트에서 제공하는 대화형식 정보에 기초하여 사용자에게 안내음성을 출력하거나 사용자로부터 음성을 입력 받아 음성인식을 수행할 수 있는 대화형 음성인식 방법, 및 그 방법을 수행하는 대화형 음성인식 서버 및 대화형 음성인식 클라이언트에 관한 것이다.
The present invention relates to an interactive voice recognition server, an interactive voice recognition client, and an interactive voice recognition method, and more particularly, to output a guide voice to a user or input a voice from the user based on conversation type information provided by the client. The present invention relates to an interactive voice recognition method capable of receiving voice recognition, and an interactive voice recognition server and an interactive voice recognition client for performing the method.

음성인식(speech recognition) 기술이란 인간이 발화한 음성과 같은 음향학적 신호(acoustic speech signal)을 컴퓨터가 분석해 이를 인식 또는 이해하는 기술을 말하는데, 발음에 따라 입 모양과 혀의 위치 변화에 따라 특정한 주파수를 갖는 다는 점을 이용하여, 발성된 음성을 전기신호로 변환한 후 음성신호의 주파수 특성을 추출해 발음을 인식하는 기술이다.Speech recognition technology refers to a technology in which a computer analyzes an acoustic speech signal such as a human speech and recognizes or understands it. A specific frequency according to the shape of the mouth and the position of the tongue depends on the pronunciation. It is a technology that recognizes the pronunciation by extracting the frequency characteristics of the speech signal after converting the spoken voice into an electrical signal by using a.

한편, 음성은 일상 생활에서 가장 효과적이고 자연스러운 의사소통방법으로서, 사용자의 음성명령의 처리는 음성인식 기술의 발전과 더불어 인간과 기계 간의 새로운 인터페이스로 제안 및 개발되고 있다. 음성입력은 키보드나 마우스 등 기존의 입력장치의 사용이 미숙한 사람도 쉽게 사용할 수 있으며 정보입력속도도 빠를 뿐만 아니라 정보입력과 동시에 손으로 다른 일을 할 수 있다는 장점이 있다. 특히, 최근 급속도로 보급되고 있는 터치 스크린을 구비한 스마트폰의 경우, 터치 스크린 상에 나타나는 가상 키보드를 통해 문자입력 등 조작을 하게 되는데, 이러한 가상 키보드에 의한 입력의 불편함을 해소할 수 있는 대안으로 음성입력이 강하게 대두되고 있으며, 음성인식 서비스를 개발하고자 하는 자가 그 서비스를 용이하게 구현할 수 있도록 하는 방안에 대한 요구가 계속 커지고 있는 실정이다.On the other hand, voice is the most effective and natural communication method in daily life, and the processing of the user's voice command has been proposed and developed as a new interface between human and machine with the development of voice recognition technology. Voice input can be easily used by those who are inexperienced in using existing input devices such as keyboards and mice. It also has the advantage of fast information input speed and other tasks by hand at the same time as information input. Particularly, in the case of a smart phone having a touch screen, which is rapidly spreading in recent years, a character input or the like is operated through a virtual keyboard appearing on the touch screen, and an alternative that can solve the inconvenience of the input by the virtual keyboard. As a result, voice inputs are rising strongly, and the demand for a method for easily implementing a voice recognition service can be easily implemented.

한편, 종래의 음성인식 방법은 일반적으로 사용자가 발화한 음성에서 특징 데이터를 추출하고, 추출된 특징 데이터를 이용하여 어휘 데이터베이스에 등록된 단어 중 유사도가 높은 단어를 선정하는 과정으로 이루어지는데, 어휘 데이터베이스가 너무 많은 어휘를 포함하고 있는 경우에는 서로 유사한 단어가 다수 존재하게 되고, 그로 인해 인식률이 저하되는 문제가 발생한다.
On the other hand, the conventional speech recognition method generally consists of extracting feature data from the speech spoken by the user and selecting words with high similarity among words registered in the lexical database using the extracted feature data. Contains too many vocabulary words, many words that are similar to each other exist, which causes a problem of lowering the recognition rate.

본 발명이 이루고자 하는 기술적 과제는 상술한 종래의 문제점을 해결하고자 하는 것으로서, 음성인식 클라이언트에서 동작하는 소정의 음성인식 서비스를 개발하고자 하는 자가 용이하게 이용할 수 있는 음성인식 API를 제공하는 것이다.SUMMARY OF THE INVENTION The present invention has been made in an effort to solve the above-described problems, and provides a voice recognition API that can be easily used by a person who wants to develop a predetermined voice recognition service that operates in a voice recognition client.

또한, 음성인식을 원하는 클라이언트가 음성인식에 사용될 어휘집합을 결정할 수 있도록 함으로써 음성인식의 인식률을 향상시키고 신속하게 음성인식을 할 수 있는 음성인식 시스템 및 음성인식 방법을 제공하는 것이다.
In addition, it is possible to provide a speech recognition system and a speech recognition method that can improve the speech recognition rate and quickly recognize the speech by allowing a client who wants speech recognition to determine the lexical set to be used for speech recognition.

본 발명의 일 측면에 따르면, (a) 음성인식 서버가 클라이언트로부터 사용자에 의해 순차적으로 발화되는 적어도 하나의 인식대상 음성 각각에 상응하는 인식대상 음성 정보를 포함하는 대화형식 정보를 수신하는 단계, (b) 상기 음성인식 서버가 상기 대화형식 정보에 기초하여 상기 클라이언트로부터 상기 적어도 하나의 인식대상 음성 각각을 순차적으로 수신하는 단계, (c) 상기 음성인식 서버가 수신한 상기 적어도 하나의 인식대상 음성 각각에 대한 음성 인식을 수행하는 단계 및 (d) 상기 음성인식 서버가 상기 적어도 하나의 인식대상 음성 각각에 대한 음성인식 결과를 상기 클라이언트로 전송하는 단계를 포함하는 대화형 음성인식 방법이 제공된다.According to an aspect of the present invention, (a) the speech recognition server receives from the client conversation type information including speech recognition information corresponding to each of at least one speech recognition object that is sequentially spoken by the user, ( b) the voice recognition server sequentially receiving each of the at least one recognition target voice from the client based on the conversation type information, and (c) each of the at least one recognition target voice received by the voice recognition server. And performing voice recognition on the voice recognition server, and (d) transmitting the voice recognition result of each of the at least one voice to be recognized to the client by the voice recognition server.

일 실시예에서, 상기 인식대상 음성 정보 각각은 항목안내 텍스트를 포함하며, 상기 (b) 단계는, 상기 음성인식 서버가, 상기 적어도 하나의 인식대상 음성 각각에 대하여 순차적으로, 상기 인식대상 음성에 상응하는 상기 인식대상 음성 정보에 포함된 항목안내 텍스트를 음성 변환하여 상기 클라이언트에 전송하고 상기 클라이언트로부터 상기 인식대상 음성을 수신하는 단계를 포함할 수 있다.In one embodiment, each of the recognition target voice information includes an item guide text, the step (b), the voice recognition server, for each of the at least one recognition target voice sequentially, the recognition target voice And converting the item guide text included in the speech recognition information corresponding to the speech into the client and receiving the speech to be received from the client.

일 실시예에서, 상기 적어도 하나의 인식대상 음성 정보 각각은 음성인식 결과 변수를 포함하며, 상기 (d) 단계는, 상기 음성인식 서버가, 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성을 음성 인식한 결과를 상기 인식대상 음성에 상응하는 음성인식 결과 변수에 저장하여 상기 클라이언트로 전송하는 단계를 포함할 수 있다.In one embodiment, each of the at least one recognition target voice information includes a voice recognition result variable, wherein step (d) is performed by the voice recognition server for each of the at least one recognition target voice. And storing the result of speech recognition in a voice in a speech recognition result variable corresponding to the speech to be recognized.

일 실시예에서, 상기 대화형식 정보는, 상기 적어도 하나의 인식대상 음성 정보에 포함된 음성인식 결과 변수를 포함하는 결과안내 텍스트를 더 포함하며, 상기 대화형 음성인식 방법은, 상기 결과안내 텍스트에 기초하여 결과안내 음성을 생성하고, 생성된 상기 결과안내 음성을 상기 클라이언트에 전송하는 단계를 더 포함할 수 있다.In one embodiment, the dialogue format information further includes a result guide text including a voice recognition result variable included in the at least one recognition target voice information, and the interactive voice recognition method includes: Generating a result guide voice based on the result; and transmitting the generated result guide voice to the client.

일 실시예에서, 상기 대화형 음성인식 방법은,In one embodiment, the interactive voice recognition method,

(e) 상기 음성인식 서버가, 상기 (c) 단계 이전에, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 기초하여 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상어휘풀을 결정하는 단계를 더 포함하며, 상기 (c) 단계는, 상기 음성인식 서버가, 수신한 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성의 인식대상어휘풀을 이용하여 음성인식을 수행하는 단계를 포함할 수 있다.(e) The voice recognition server, before the step (c), for each of the recognition target voice information, on the basis of the recognition target voice information, the recognition target vocabulary of the recognition target voice corresponding to the recognition target voice information; The method may further include: determining, by the voice recognition server, speech recognition using the recognition target lexicon of the recognition target voice, for each of the received at least one recognition target voice. It may include the step of performing.

일 실시예에서, 상기 적어도 하나의 인식대상 음성 정보 각각은, 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상어휘풀을 식별하기 위한 어휘풀식별정보를 포함하며, 상기 대화형식 정보는, 적어도 하나의 어휘집합을 더 포함하며, 상기 적어도 하나의 어휘집합 각각은, 상기 어휘풀식별정보 중 어느 하나에 의해 식별될 수 있다.In one embodiment, each of the at least one recognition target speech information includes lexical pool identification information for identifying a recognition target lexicon of the recognition target speech corresponding to the recognition target speech information, wherein the dialogue format information, Further comprising at least one vocabulary set, each of the at least one vocabulary set, can be identified by any one of the lexical pool identification information.

일 실시예에서, 상기 (e) 단계는, 상기 음성인식 서버가, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보에 포함된 경우 상기 어휘집합을 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상어휘풀로 결정하는 단계를 더 포함할 수 있다.In one embodiment, the step (e), the speech recognition server, the lexical set identified by the lexical identification information included in the recognition target voice information for each of the recognition target voice information is the conversation type information The method may further include determining the lexical set as a recognition target lexicon of the recognition target voice corresponding to the recognition target speech information.

일 실시예에서, 상기 음성인식 방법은, 상기 음성인식 서버가, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보에 포함된 경우 상기 어휘집합을 데이터베이스에 저장하는 단계를 더 포함할 수 있다.In one embodiment, the speech recognition method, the speech recognition server, for each of the recognition target voice information, the lexical set identified by the lexical pool identification information included in the recognition target voice information to the conversation type information If included, the method may further include storing the lexicon in a database.

일 실시예에서, 상기 (e) 단계는, 상기 음성인식 서버가, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보에 포함되지 않은 경우 상기 데이터베이스에 저장된 어휘집합 중 상기 어휘풀식별정보에 의해 식별되는 어휘집합을 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상어휘풀로 결정하는 단계를 포함할 수 있다.In one embodiment, the step (e), the speech recognition server, the lexical set identified by the lexical identification information included in the recognition target voice information for each of the recognition target voice information is the conversation type information If not included in the vocabulary set stored in the database may include determining the vocabulary set identified by the vocabulary identification information as the recognition target vocabulary of the speech to be recognized corresponding to the speech information.

일 실시예에서, 상기 음성인식 방법은,In one embodiment, the voice recognition method,

상기 음성인식 서버가, 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성의 인식대상어휘풀에 포함된 어휘에 대한 어휘 트리(lexical tree)를 생성하는 단계를 더 포함하되, 상기 (c) 단계는, 상기 음성인식 서버가, 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성의 인식대상어휘풀에 대한 어휘 트리를 이용하여 상기 인식대상 음성에 대한 음성인식을 수행하는 단계를 포함할 수 있다.The speech recognition server, for each of the at least one recognition target voice, generating a lexical tree (lexical tree) for the vocabulary included in the recognition target lexicon of the recognition target voice, the (c ), The voice recognition server, for each of the at least one recognition target voice, performing the speech recognition for the recognition target voice using a lexical tree for the recognition target vocabulary of the recognition target voice. It may include.

본 발명의 다른 일 측면에 따르면, (a) 음성인식 클라이언트가 사용자에 의해 순차적으로 발화되는 적어도 하나의 인식대상 음성 각각에 상응하는 인식대상 음성 정보를 포함하는 대화형식 정보를 음성인식 서버로 전송하는 단계, (b) 음성인식 클라이언트가 상기 대화형식 정보에 기초하여 상기 적어도 하나의 인식대상 음성 각각을 순차적으로 상기 음성인식 서버로 전송하는 단계, (c) 상기 음성인식 클라이언트가 상기 음성인식 서버로부터 상기 적어도 하나의 인식대상 음성 각각에 대한 음성인식 결과를 수신하는 단계를 포함하는 대화형 음성인식 방법이 제공된다.According to another aspect of the present invention, (a) the voice recognition client transmits the conversation type information including the recognition target voice information corresponding to each of the at least one recognition target speech sequentially uttered by the user to the voice recognition server (B) the voice recognition client sequentially transmitting each of the at least one voice to be recognized to the voice recognition server based on the conversation type information, and (c) the voice recognition client from the voice recognition server. An interactive speech recognition method is provided that includes receiving a speech recognition result for each of at least one speech to be recognized.

일 실시예에서, 상기 인식대상 음성 정보 각각은 항목안내 텍스트를 포함하며, 상기 (b) 단계는, 상기 음성인식 클라이언트가, 상기 적어도 하나의 인식대상 음성 각각에 대하여 순차적으로, 상기 인식대상 음성에 상응하는 상기 인식대상 음성 정보에 포함된 항목안내 텍스트를 음성 변환하여 출력하고 상기 인식대상 음성을 상기 음성인식 서버로 전송하는 단계를 포함할 수 있다.In one embodiment, each of the recognition target voice information includes an item guide text, the step (b), the voice recognition client, for each of the at least one recognition target voice, sequentially to the recognition target voice And converting and outputting the item guide text included in the corresponding recognition target voice information and transmitting the recognition target voice to the voice recognition server.

일 실시예에서, 상기 적어도 하나의 인식대상 음성 정보 각각은 음성인식 결과 변수를 포함하며, 상기 (c) 단계는, 상기 음성인식 클라이언트가, 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성에 대한 음성 인식한 결과가 저장된 상기 인식대상 음성에 상응하는 음성인식 결과 변수를 수신하는 단계를 포함할 수 있다.In one embodiment, each of the at least one recognition target voice information includes a voice recognition result variable, and in step (c), the voice recognition client, for each of the at least one recognition target voice, the recognition target And receiving a voice recognition result variable corresponding to the voice to be recognized, in which a voice recognition result for the voice is stored.

일 실시예에서, 상기 대화형식 정보는, 상기 적어도 하나의 인식대상 음성 정보에 포함된 음성인식 결과 변수를 포함하는 결과안내 텍스트를 더 포함하며, 상기 대화형 음성인식 방법은, 상기 음성인식 클라이언트가 상기 결과안내 텍스트에 기초하여 결과안내 음성을 생성하고 생성된 상기 결과안내 음성을 출력하는 단계를 더 포함할 수 있다.In one embodiment, the conversation type information further comprises a result guide text including a voice recognition result variable included in the at least one recognition target voice information, the interactive voice recognition method, the voice recognition client, The method may further include generating a result guide voice based on the result guide text and outputting the generated result guide voice.

본 발명의 다른 일 측면에 따르면, 상술한 음성인식 방법을 수행하는 컴퓨터 프로그램이 기록된 컴퓨터 판독 가능한 기록매체가 제공된다.According to another aspect of the present invention, there is provided a computer-readable recording medium on which a computer program for performing the above voice recognition method is recorded.

본 발명의 다른 일 측면에 따르면, 클라이언트로부터 사용자에 의해 순차적으로 발화되는 적어도 하나의 인식대상 음성 각각에 상응하는 인식대상 음성 정보를 포함하는 대화형식 정보를 수신하는 정보수신모듈, 상기 대화형식 정보에 기초하여 상기 클라이언트로부터 상기 적어도 하나의 인식대상 음성 각각을 순차적으로 수신하는 음성수신모듈, 수신한 상기 적어도 하나의 인식대상 음성 각각에 대한 음성 인식을 수행하는 음성인식모듈; 및 상기 적어도 하나의 인식대상 음성 각각에 대한 음성인식 결과를 상기 클라이언트로 전송하는 결과전송모듈을 포함하는 대화형 음성인식 서버가 제공된다.According to another aspect of the present invention, an information receiving module for receiving a conversation type information including recognition object voice information corresponding to each of at least one recognition object speech sequentially uttered by a user from a client, the conversation type information A voice receiving module sequentially receiving each of the at least one recognition target voice from the client, and a voice recognition module performing voice recognition on each of the at least one recognition target voice; And a result transmission module for transmitting a voice recognition result for each of the at least one recognition target voice to the client.

일 실시예에서, 상기 인식대상 음성 정보 각각은 항목안내 텍스트를 포함하며, 상기 대화형 음성인식 서버는, 상기 적어도 하나의 인식대상 음성 각각을 수신하기 전에, 상기 인식대상 음성에 상응하는 상기 인식대상 음성 정보에 포함된 항목안내 텍스트를 음성 변환하여 상기 클라이언트에 전송하는 TTS(Text-To-Speech)모듈을 더 포함할 수 있다.In one embodiment, each of the recognition target voice information includes an item guide text, the interactive voice recognition server, before receiving each of the at least one recognition target voice, the recognition target corresponding to the recognition target voice The apparatus may further include a text-to-speech (TTS) module that converts the item guide text included in the voice information and transmits the converted text to the client.

일 실시예에서, 상기 적어도 하나의 인식대상 음성 정보 각각은 음성인식 결과 변수를 포함하며, 상기 결과전송모듈은, 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성을 음성 인식한 결과를 상기 인식대상 음성에 상응하는 음성인식 결과 변수에 저장하여 상기 클라이언트로 전송할 수 있다.In one embodiment, each of the at least one recognition target voice information includes a voice recognition result variable, the result transmission module, for each of the at least one recognition target voice, the result of the voice recognition of the recognition target voice The voice recognition result variable corresponding to the recognition target voice may be stored and transmitted to the client.

일 실시예에서, 상기 대화형식 정보는, 상기 적어도 하나의 인식대상 음성 정보에 포함된 음성인식 결과 변수를 포함하는 결과안내 텍스트를 더 포함하며, 상기 대화형 음성인식 서버는, 상기 결과안내 텍스트에 기초하여 결과안내 음성을 생성하고 생성된 상기 결과안내 음성을 상기 클라이언트에 전송하는 TTS모듈을 더 포함할 수 있다.In one embodiment, the conversation type information further comprises a result guide text including a voice recognition result variable included in the at least one recognition target voice information, the interactive voice recognition server, The apparatus may further include a TTS module configured to generate a result guide voice based on the result and to transmit the generated result guide voice to the client.

일 실시예에서, 상기 대화형 음성인식 서버는, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 기초하여 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상어휘풀을 결정하는 어휘풀 결정모듈을 더 포함하며, 상기 음성인식모듈은, 수신한 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성의 인식대상어휘풀을 이용하여 음성인식을 수행할 수 있다.In one embodiment, the interactive voice recognition server, for each of the recognition target voice information, a vocabulary for determining a recognition target vocabulary of the recognition target voice corresponding to the recognition target voice information based on the recognition target voice information. The apparatus may further include a pool determination module, and the voice recognition module may perform voice recognition on each of the received at least one recognition target voice using a recognition target vocabulary of the recognition target voice.

일 실시예에서, 상기 어휘풀 결정모듈은, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보에 포함된 경우 상기 어휘집합을 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상어휘풀로 결정할 수 있다.In an embodiment, the lexical pool determining module may include, for each of the speech target speech information, the lexical set identified by the lexical pool identification information included in the speech speech information included in the conversation type information. The set may be determined as a recognition target vocabulary of the recognition target speech corresponding to the recognition target speech information.

일 실시예에서, 상기 음성인식 서버는 저장모듈을 더 포함하되, 상기 저장모듈은, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보에 포함된 경우 상기 어휘집합을 데이터베이스에 저장할 수 있다.In one embodiment, the voice recognition server further comprises a storage module, wherein the storage module, for each of the recognition target voice information, the lexical set identified by the lexical pool identification information included in the recognition target voice information When included in the conversation type information, the lexicon may be stored in a database.

일 실시예에서, 상기 어휘풀 결정모듈은, 상기 인식대상 음성 정보 각각에 대하여, 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보에 포함되지 않은 경우 상기 데이터베이스에 저장된 어휘집합 중 상기 어휘풀식별정보에 의해 식별되는 어휘집합을 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상어휘풀로 결정할 수 있다.In one embodiment, the lexical pool determining module is configured to, when the lexical set identified by the lexical pool identification information included in the recognition target voice information, is not included in the conversation type information for each of the recognition target voice information. The lexical set identified by the lexical pool identification information among the lexical sets stored in the database may be determined as the recognition target lexicon of the recognition target speech corresponding to the recognition target speech information.

일 실시예에서, 상기 음성인식 서버는,In one embodiment, the voice recognition server,

상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성의 인식대상어휘풀에 포함된 어휘에 대한 어휘 트리(lexical tree)를 생성하는 어휘트리 생성모듈을 더 포함하되, 상기 음성인식모듈은, 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성의 인식대상어휘풀에 대한 어휘 트리를 이용하여 상기 인식대상 음성에 대한 음성인식을 수행할 수 있다.For each of the at least one recognition target voice, further comprising a lexical tree generation module for generating a lexical tree for the vocabulary included in the recognition target lexicon of the recognition target voice, wherein the speech recognition module, For each of the at least one voice to be recognized, voice recognition may be performed on the voice to be recognized using a lexical tree of a pool of words to be recognized.

본 발명의 다른 일 측면에 따르면, 사용자에 의해 순차적으로 발화되는 적어도 하나의 인식대상 음성 각각에 상응하는 인식대상 음성 정보를 포함하는 대화형식 정보를 음성인식 서버로 전송하는 정보전송모듈;According to another aspect of the invention, the information transmission module for transmitting the conversation type information including the recognition target voice information corresponding to each of the at least one recognition target speech sequentially uttered by the user to the voice recognition server;

상기 대화형식 정보에 기초하여 상기 적어도 하나의 인식대상 음성 각각을 순차적으로 상기 음성인식 서버로 전송하는 음성전송모듈; 및A voice transmission module for sequentially transmitting each of the at least one recognition target voice to the voice recognition server based on the conversation type information; And

상기 음성인식 서버로부터 상기 적어도 하나의 인식대상 음성 각각에 대한 음성인식 결과를 수신하는 결과수신모듈을 포함하는 대화형 음성인식 클라이언트가 제공된다.There is provided an interactive voice recognition client including a result receiving module for receiving a voice recognition result for each of the at least one voice to be recognized from the voice recognition server.

일 실시예에서, 상기 인식대상 음성 정보 각각은 항목안내 텍스트를 포함하며, 상기 대화형 음성인식 클라이언트는, 상기 적어도 하나의 인식대상 음성 각각을 전송하기 전에, 상기 인식대상 음성에 상응하는 상기 인식대상 음성 정보에 포함된 항목안내 텍스트를 음성 변환하여 출력하는 TTS모듈을 더 포함할 수 있다.In one embodiment, each of the recognition target voice information includes an item guide text, and before the interactive voice recognition client transmits each of the at least one recognition target voice, the recognition target corresponding to the recognition target voice. The apparatus may further include a TTS module for converting and outputting the item guide text included in the voice information.

일 실시예에서, 상기 적어도 하나의 인식대상 음성 정보 각각은 음성인식 결과 변수를 포함하며, 상기 결과수신모듈은, 상기 적어도 하나의 인식대상 음성 각각에 대하여, 상기 인식대상 음성에 대한 음성 인식한 결과가 저장된 상기 인식대상 음성에 상응하는 음성인식 결과 변수를 수신하는 단계를 포함할 수 있다.In one embodiment, each of the at least one recognition target voice information includes a voice recognition result variable, the result receiving module, for each of the at least one recognition target voice, the result of the voice recognition for the recognition target voice The method may include receiving a voice recognition result variable corresponding to the stored recognition voice.

일 실시예에서, 상기 대화형식 정보는, 상기 적어도 하나의 인식대상 음성 정보에 포함된 음성인식 결과 변수를 포함하는 결과안내 텍스트를 더 포함하며, 상기 대화형 음성인식 클라이언트는, 상기 결과안내 텍스트에 기초하여 결과안내 음성을 생성하고 생성된 상기 결과안내 음성을 출력하는 TTS모듈을 더 포함할 수 있다.
In one embodiment, the conversation type information further comprises a result guide text including a voice recognition result variable included in the at least one recognition target voice information, the interactive voice recognition client, The apparatus may further include a TTS module configured to generate a result guide voice based on the result and output the generated result guide voice.

본 발명의 실시예에 따르면 상기 음성인식 클라이언트는 자신이 사용자에게 제공하는 서비스에 특화된 대화 구조를 지정하여 상기 대화형 음성인식 서버가 음성인식을 수행하도록 함으로써, 상기 대화형 음성인식 클라이언트의 서비스에 부합하는 음성인식 결과를 얻을 수 있는 효과가 있다. 즉, 본 발명의 실시예에 따르면, 상기 대화형 음성인식 클라이언트에서 동작하는 소정의 음성인식 서비스를 개발하고자 하는 자가 용이하게 이용할 수 있는 음성인식 API를 제공할 수 있다.According to an embodiment of the present invention, the voice recognition client specifies a conversation structure specialized for a service provided to the user so that the interactive voice recognition server performs voice recognition, thereby meeting the service of the interactive voice recognition client. It is effective to obtain a voice recognition result. That is, according to an embodiment of the present invention, it is possible to provide a voice recognition API that can be easily used by a person who wants to develop a predetermined voice recognition service operating in the interactive voice recognition client.

또한, 본 발명의 실시예에 따르면, 인식대상 음성에 대한 음성 인식에 이용되는 어휘풀이 일반 사전이 아닌 클라이언트에 의해 한정된 어휘집합이므로 빠르고 정확한 음성 인식이 가능한 효과가 있다. 통상적인 음성인식의 경우에는 인식에 사용되는 어휘가 방대하여 음성인식에 높은 프로세싱 파워가 요구될 뿐만 아니라, 서로 유사한 단어가 많으므로 인식율이 그만큼 떨어지게 된다. 하지만 본 실시예에 따르면, 상기 음성인식 서버는 클라이언트가 제공하는 한정된 어휘집합을 이용하여 음성인식을 수행하므로 빠르고 정확한 인식이 가능하다는 효과가 있다.
In addition, according to an embodiment of the present invention, since the lexical pool used for speech recognition of the speech to be recognized is a lexical set defined by the client rather than a general dictionary, there is an effect of enabling fast and accurate speech recognition. In the case of the general speech recognition, the vocabulary used for the recognition is enormous so that high processing power is required for the speech recognition, and the recognition rate decreases because there are many similar words. However, according to the present embodiment, the voice recognition server performs voice recognition using a limited set of vocabularies provided by a client, thereby enabling fast and accurate recognition.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 간단한 설명이 제공된다.
도 1은 본 발명의 일 실시예에 따른 대화형 음성인식 시스템의 개략적인 구성 및 동작 방법을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 대화형 음성인식 서버의 구성을 나타내는 블록도이다.
도 3는 본 발명의 일 실시예에 따른 음성인식 클라이언트의 구성을 나타내는 블록도이다.
도 4a는 본 발명의 일 실시예에 따른 음성인식 클라이언트에서 동작하는 애플리케이션의 일 예를 나타내는 도면이다.
도 4b는 도 4a에 도시된 애플리케이션에 상응하며 어휘집합을 포함하는 대화 형식 정보의 일 예를 나타내는 도면이다.
도 4c는 본 발명의 일 실시예에 따른 음성인식 클라이언트가 도 4a에 도시된 애플리케이션의 입력 위젯을 모두 채운 후의 모습을 나타내는 도면이다.
도 5는 도 4a에 도시된 애플리케이션에 상응하며 어휘집합을 포함하지 않는 대화 형식 정보의 일 예를 나타내는 도면이다.
도 6은 대화 형식 정보가 도4b와 같은 경우 본 발명의 일 실시예에 따른 음성인식 클라이언트 및 서버가 동작하는 방법을 나타내는 도면이다.
도 7은 대화 형식 정보가 도5와 같은 경우 본 발명의 일 실시예에 따른 음성인식 클라이언트 및 서버가 동작하는 방법을 나타내는 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS In order to better understand the drawings cited in the detailed description of the invention, a brief description of each drawing is provided.
1 is a view showing a schematic configuration and operation method of an interactive voice recognition system according to an embodiment of the present invention.
2 is a block diagram showing the configuration of an interactive voice recognition server according to an embodiment of the present invention.
3 is a block diagram showing the configuration of a voice recognition client according to an embodiment of the present invention.
4A is a diagram illustrating an example of an application operating in a voice recognition client according to an embodiment of the present invention.
4B is a diagram illustrating an example of conversational format information corresponding to the application illustrated in FIG. 4A and including a lexical set.
4C is a diagram illustrating a state after the voice recognition client fills all the input widgets of the application illustrated in FIG. 4A according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating an example of conversational format information that corresponds to the application illustrated in FIG. 4A and does not include a lexical set.
6 is a diagram illustrating a method of operating a voice recognition client and a server according to an embodiment of the present invention when the conversation type information is the same as that of FIG. 4B.
7 is a diagram illustrating a method of operating a voice recognition client and a server according to an embodiment of the present invention when the conversation type information is the same as that of FIG. 5.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.BRIEF DESCRIPTION OF THE DRAWINGS The present invention is capable of various modifications and various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 명세서에 있어서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this specification, terms such as "comprise" or "have" are intended to indicate that there is a feature, number, step, action, component, part, or combination thereof described in the specification, one or more other It is to be understood that the present invention does not exclude the possibility of the presence or the addition of features, numbers, steps, operations, components, parts, or a combination thereof.

또한, 본 명세서에 있어서는 어느 하나의 구성요소가 다른 구성요소로 데이터를 '전송'하는 경우에는 상기 구성요소는 상기 다른 구성요소로 직접 상기 데이터를 전송할 수도 있고, 적어도 하나의 또 다른 구성요소를 통하여 상기 데이터를 상기 다른 구성요소로 전송할 수도 있는 것을 의미한다. 반대로 어느 하나의 구성요소가 다른 구성요소로 데이터를 '직접 전송'하는 경우에는 상기 구성요소에서 다른 구성요소를 통하지 않고 상기 다른 구성요소로 상기 데이터가 전송되는 것을 의미한다.Also, in this specification, when any one element 'transmits' data to another element, the element may transmit the data directly to the other element, or may be transmitted through at least one other element And may transmit the data to the other component. Conversely, when one element 'directly transmits' data to another element, it means that the data is transmitted to the other element without passing through another element in the element.

이하, 첨부된 도면들을 참조하여 본 발명의 실시예들을 중심으로 본 발명을 상세히 설명한다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. Like reference symbols in the drawings denote like elements.

도 1은 본 발명의 일 실시예에 따른 대화형 음성인식 시스템의 개략적인 구성 및 동작 방법을 나타내는 도면이다.1 is a view showing a schematic configuration and operation method of an interactive voice recognition system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 대화형 음성인식 시스템은 대화형 음성인식 서버(10) 및 대화형 음성인식 클라이언트(20)를 포함할 수 있다.Referring to FIG. 1, an interactive voice recognition system according to an embodiment of the present invention may include an interactive voice recognition server 10 and an interactive voice recognition client 20.

상기 대화형 대화형 음성인식 서버(10)와 상기 대화형 음성인식 클라이언트(20)는 유/무선을 통해 연결되어, 소정의 정보를 송수신하며 본 발명의 기술적 사상을 달성할 수 있다.The interactive interactive voice recognition server 10 and the interactive voice recognition client 20 may be connected via wired / wireless to transmit / receive predetermined information and achieve the technical idea of the present invention.

상기 대화형 음성인식 서버(10)는 상기 대화형 음성인식 클라이언트(20)로부터 순차적으로 전송되는 인식대상 음성에 대한 음성인식을 수행하고(도 1의 S2-1 내지 S2-n 및 S3-1 내지 S3-n 참조), 음성인식 수행 결과를 상기 음성인식 클라이언트에 전송(도 1의 S4 참조)하는 컴퓨팅 장치일 수 있다.The interactive voice recognition server 10 performs voice recognition for voices to be sequentially transmitted from the interactive voice recognition client 20 (S2-1 to S2-n and S3-1 to FIG. 1). S3-n), and may be a computing device that transmits a voice recognition result to the voice recognition client (see S4 of FIG. 1).

한편, 상기 대화형 음성인식 서버(10)는 음성인식을 수행하기 전 상기 대화형 음성인식 클라이언트(20)로부터 소정의 대화형식 정보를 수신할 수 있다(도 1의 S1 참조). 대화형식 정보는 상기 대화형 음성인식 서버(10)와 상기 대화형 음성인식 클라이언트(20) 간의 대화 구조에 관한 정보일 수 있다. 여기서, 대화 구조라 함은 음성인식 서버(10)가 음성인식 클라이언트(20)로 출력할 음성(예를 들면, 안내음성)과 상기 음성인식 클라이언트(20)가 상기 음성인식 서버(10)로 출력할 발화자의 음성으로 이루어지는 대화의 순서나 대화를 구성하는 음성의 내용이나 형식 등을 정의하기 위하여 구조화된 정보를 의미할 수 있다. 따라서, 대화 형식정보는, 예를 들어 도 5에 도시된 바와 같이, 음성인식 서버(10) 측에서 음성인식 클라이언트(20)를 통해 사용자에게 출력할 안내음성의 내용과 사용자가 발화하여 음성인식 서버(10) 측으로 전송될 인식대상 음성의 형식이 대화 순서에 따라 나열되어 상기 음성인식 서버(10)와 상기 음성인식 클라이언트(20) 간의 대화가 어떠한 순서와 어떠한 내용으로 전개되는지를 파악할 수 있는 정보일 수 있다. 따라서, 상기 대화형 음성인식 서버(10)는 상기 대화형식 정보를 이용하여 이후 사용자에게 출력할 안내음성과 사용자로부터 입력될 인식대상 음성을 포함하는 사용자와 서버간의 대화 구조를 파악할 수 있다.Meanwhile, the interactive voice recognition server 10 may receive predetermined conversation type information from the interactive voice recognition client 20 before performing voice recognition (see S1 of FIG. 1). The conversation type information may be information about a conversation structure between the interactive voice recognition server 10 and the interactive voice recognition client 20. Herein, the conversation structure refers to a voice (for example, guide voice) that the voice recognition server 10 outputs to the voice recognition client 20 and the voice recognition client 20 to output to the voice recognition server 10. It may refer to structured information to define the order of conversations made by the talker's voice or the content or format of the voice constituting the conversation. Therefore, as shown in FIG. 5, for example, the dialogue format information includes the contents of the guide voice to be output to the user through the voice recognition client 20 at the voice recognition server 10, and the user speaks the voice recognition server. The format of the recognition target voice to be transmitted to the (10) side is listed in the order of conversation so as to determine in what order and in what contents the dialogue between the speech recognition server 10 and the speech recognition client 20 is developed. Can be. Accordingly, the interactive voice recognition server 10 may grasp the structure of the dialogue between the user and the server including the guide voice to be output to the user and the recognition target voice to be input from the user by using the dialogue format information.

상기 대화형식 정보는 음성인식 클라이언트(10)에서 사용자에게 제공하는 서비스(또는 애플리케이션)에 적합하도록 상기 서비스(또는 애플리케이션)의 개발자 등에 의해 미리 설정될 수 있다. 즉, 본 발명의 일 실시예에 따르면, 개발자 등은 서버에 의해 미리 정해진 대화형식을 이용하는 대신, 상기 음성인식 서버(20)는 범용적으로 사용될 수 있는 대화형 음성인식의 개별 단위기능(예를 들면, 소정의 안내음성을 출력하거나 사용자가 발화한 한 단위의 단어나 문장을 음성인식 하는 등)을 API 등의 형태로 제공하고, 음성인식 클라이언트(10)에서 제공되는 서비스를 개발하고자 하는 개발자 등은 이를 이용하여 자신이 제공하고자 하는 서비스에 맞춰진 대화형식 정보를 구성할 수 있다.The conversation type information may be set in advance by a developer of the service (or application) or the like so as to be suitable for a service (or application) provided to the user by the voice recognition client 10. That is, according to an embodiment of the present invention, instead of using a predetermined conversational format by the server, the voice recognition server 20 may be an individual unit function of the interactive voice recognition that can be used universally. For example, a developer who wants to develop a service provided by the voice recognition client 10 by providing a predetermined guide voice or providing a speech or the like for a unit of a word or sentence spoken by the user. This can be used to configure the conversational information that is tailored to the service they want to provide.

한편, 상기 대화형 음성인식 서버(10)는 소정의 데이터베이스(30)를 포함하거나 상기 데이터베이스(30)와 통신하여, 상기 대화형식 정보에 포함된 각종 데이터를 상기 데이터베이스(30)에 저장할 수 있다. 본 명세서에서 데이터베이스(DB)라 함은, 적어도 하나의 테이블로 구현될 수도 있으며, 상기 데이터베이스에 저장된 정보를 검색, 저장, 및 관리하기 위한 별도의 DBMS(Database Management System)을 더 포함하는 의미로 사용될 수도 있다. 또한, 링크드 리스트(linked-list), 트리(Tree), 관계형 DB의 형태 등 다양한 방식으로 구현될 수 있으며, 상기 데이터베이스(30)에 저장될 정보를 저장할 수 있는 모든 데이터 저장매체 및 데이터 구조를 포함하는 의미로 사용될 수 있다.The interactive voice recognition server 10 may include a predetermined database 30 or communicate with the database 30 to store various data included in the conversation type information in the database 30. In the present specification, the database (DB) may be implemented as at least one table, and used as a meaning further including a separate database management system (DBMS) for searching, storing, and managing information stored in the database. It may be. In addition, it may be implemented in various ways such as linked-list, tree, relational DB, and includes all data storage media and data structures capable of storing information to be stored in the database 30. It can be used to mean.

상기 대화형 음성인식 클라이언트(20)는 사용자의 음성을 입력 받아 소정의 음성인식 서비스를 제공하는 컴퓨터 프로그램을 구동하는 컴퓨팅 장치일 수 있다. 상기 대화형 음성인식 클라이언트(20)는 컴퓨터, 랩탑, 데스크탑, 타블렛 PC, PDA(Personal Digital Assistant)를 포함하는 컴퓨팅 장치일 수 있으며, 휴대 전화, 위성 전화, 무선 전화, SIP(Session Initiation Protocol), WLL(Wireless Local Loop) 스테이션, 스마트폰, 기타 무선 접속 기능을 갖는 핸드헬드 장치를 포함하는 무선 컴퓨팅 장치 또는 다른 무선 모뎀에 연결된 프로세싱 장치일 수도 있다.The interactive voice recognition client 20 may be a computing device that receives a user's voice and runs a computer program that provides a predetermined voice recognition service. The interactive voice recognition client 20 may be a computing device including a computer, a laptop, a desktop, a tablet PC, a personal digital assistant (PDA), a mobile phone, a satellite phone, a wireless phone, a Session Initiation Protocol (SIP), It may also be a wireless computing device including a wireless local loop (WLL) station, a smart phone, other handheld device with wireless connectivity, or a processing device coupled to another wireless modem.

도 2는 본 발명의 일 실시예에 따른 대화형 음성인식 서버(10)의 구성을 나타내는 블록도이다.2 is a block diagram showing the configuration of an interactive voice recognition server 10 according to an embodiment of the present invention.

도 2를 참조하면, 상기 대화형 음성인식 서버(10)는 정보수신모듈(110), 어휘풀결정모듈(120), 어휘트리 생성모듈(130), 음성수신모듈(140), 음성인식모듈(150), 결과전송모듈, TTS(Text-To-Speech)모듈(170) 및 저장모듈(180)을 포함할 수 있다.2, the interactive voice recognition server 10 includes an information receiving module 110, a lexical full determining module 120, a lexical tree generating module 130, a voice receiving module 140, and a voice recognition module ( 150, a result transmission module, a text-to-speech (TTS) module 170, and a storage module 180.

상기 정보수신모듈(110)은 상기 대화형 음성인식 클라이언트(20)로부터 소정의 대화형식 정보를 수신할 수 있다.The information receiving module 110 may receive predetermined conversation type information from the interactive voice recognition client 20.

상기 대화형식 정보는 상기 대화형 음성인식 클라이언트(20)의 사용자가 상기 대화형 음성인식 클라이언트의 입력모듈(예를 들면 마이크로폰 등)을 통해 순차적으로 발화하는 적어도 하나의 인식대상 음성 각각에 상응하는 인식대상 음성 정보를 포함할 수 있다. 예를 들면, 상기 대화형 음성인식 클라이언트(20)에서 사용자가 복수의 입력 필드를 채워야 하는 소정의 애플리케이션(예를 들면, 주소록 애플리케이션, 스케줄 관리 애플리케이션 등)이 구동될 수 있는데, 상기 복수의 입력 필드를 순서대로 채우기 위하여, 사용자는 상기 복수의 입력 필드 각각에 상응하는 인식대상 음성을 순차적으로 발화할 수 있다. 그러면, 상기 음성인식 클라이언트(20)는 입력되는 사용자의 인식대상 음성을 차례로 상기 대화형 음성인식 서버(10)로 전송하여 상기 대화형 음성인식 서버(10)가 음성인식을 수행하도록 할 수 있는데, 상기 대화형식 정보는 상기 대화형 음성인식 서버(10)로 차례로 전송되는 인식대상 음성 각각에 상응하는 인식대상 음성 정보를 포함할 수 있다. 상기 인식대상 음성 정보는 항목안내 텍스트, 음성인식 결과 변수 및/또는 어휘풀식별정보 등을 포함할 수 있는데, 이에 대하여는 추후 상세히 설명하기로 한다.The conversation type information may be recognized by each user of the interactive voice recognition client 20 corresponding to each of at least one recognition target voice that is sequentially spoken through an input module (for example, a microphone) of the interactive voice recognition client. Target voice information may be included. For example, in the interactive voice recognition client 20, a predetermined application (eg, an address book application, a schedule management application, etc.) for which a user needs to fill a plurality of input fields may be driven. In order to fill in order, the user may sequentially speak the recognition target voice corresponding to each of the plurality of input fields. Then, the voice recognition client 20 may transmit the recognition target voice of the user to the interactive voice recognition server 10 in turn so that the interactive voice recognition server 10 performs voice recognition. The conversation type information may include recognition object voice information corresponding to each of the recognition object voices which are sequentially transmitted to the interactive voice recognition server 10. The recognition target voice information may include an item guide text, a voice recognition result variable, and / or lexical identification information, etc., which will be described in detail later.

상기 저장모듈(180)은 상기 대화형식 정보 및/또는 상기 인식대상 음성 정보에 포함된 각종 정보를 데이터베이스(30)에 저장하거나 갱신할 수 있다.The storage module 180 may store or update the conversation type information and / or various information included in the recognition target voice information in the database 30.

상기 음성수신모듈(140)은 상기 음성인식 클라이언트(20)가 전송하는 인식대상 음성을 수신할 수 있다.The voice receiving module 140 may receive a recognition target voice transmitted by the voice recognition client 20.

한편, 상기 음성인식모듈(150)은 상기 음성수신모듈(140)이 수신한 인식대상 음성에 대한 음성 인식을 수행할 수 있다. 일 실시예에서, 상기 음성인식모듈(150)은 상기 인식대상 음성의 종류와 무관하게 미리 정해진 어휘풀(pool)을 이용하여 음성인식을 수행할 수 있다.On the other hand, the voice recognition module 150 may perform voice recognition for the recognition target voice received by the voice receiving module 140. In one embodiment, the speech recognition module 150 may perform speech recognition using a predetermined lexical pool regardless of the type of speech to be recognized.

다른 실시예에서는 상기 음성인식모듈(150)은 각각의 인식대상 음성마다 그에 상응하는 어휘풀을 이용하여 음성인식을 수행할 수 있다. 이 경우, 상기 어휘풀 결정 모듈(120)이 각각의 인식대상 음성에 상응하는 어휘풀을 결정할 수 있다. 특히, 상기 어휘풀 결정 모듈(120)은 상기 대화형식 정보에 포함된 인식대상 음성 정보에 기초하여 상기 인식대상 음성 정보에 상응하는 인식대상 음성에 대한 음성인식을 수행하는데 이용되는 어휘풀(인식대상어휘풀)을 결정할 수 있다. 상기 대화형식 정보 또는 상기 인식대상 음성 정보에 포함된 정보에 따라 상기 대화형 음성인식 서버(10)가 상기 인식대상 어휘풀을 결정하는 방식도 다양할 수 있다. 상기 대화형식 정보 또는 상기 인식대상 음성 정보의 다양한 예시 및 그에 따른 인식대상 어휘풀 결정 방법에 대해서는 추후 상세히 설명하기로 한다.In another embodiment, the speech recognition module 150 may perform speech recognition using a lexical pool corresponding to each speech to be recognized. In this case, the lexical pool determining module 120 may determine a lexical pool corresponding to each speech to be recognized. In particular, the lexical pool determining module 120 is a lexical pool (recognition target) that is used to perform voice recognition on a speech to be recognized corresponding to the speech information to be recognized based on the speech information included in the conversational format information. Vocabulary). The interactive voice recognition server 10 may determine the recognition target lexicon according to the conversation type information or the information included in the recognition target voice information. Various examples of the conversation type information or the recognition target speech information and a method of determining a recognition target lexicon will be described in detail later.

한편, 일 실시예에서, 상기 대화형 음성인식 서버(10)는 상기 인식대상 음성을 음성 인식하는데 이용되는 인식대상 어휘풀에 대한 어휘트리(lexical tree)를 생성하는 어휘트리 생성모듈(130)을 포함할 수 있으며, 상기 음성인식모듈(150)은 상기 어휘트리 생성모듈(130)이 생성한 어휘트리를 이용하여 음성 인식을 수행할 수 있다. 또는 다른 일 실시예에서, 상기 음성인식 서버(10)는 모든 음성인식 요청에 대해 보편적으로 사용될 수 있는 어휘풀을 이용하여 상기 인식대상 음성에 대한 음성인식을 수행하고, 음성인식을 수행하는 과정에서 상기 인식대상 어휘풀에 포함된 어휘와 유사한 어휘를 인식하는 경우, 상기 인식대상어휘풀에 포함된 어휘를 우선적으로 이용하도록 구현될 수도 있다. 이외에도, 상기 대화형 음성인식 서버(10)는 음성인식을 수행함에 있어 널리 공지된 다양한 음성인식 기술을 이용할 수 있다. 본 발명의 기술적 특징 중 하나는 음성인식 서버(10)와 음성인식 클라이언트(20)가 소정의 대화형식 정보에 의해 정해진 방식대로 대화형 음성인식을 수행하는 방식에 관한 것이고 상기 대화형 음성인식 서버(10)가 이용하는 음성인식 기술에 의해 본 발명의 기술적 사상이 제한되는 것은 아니므로 본 발명의 요지를 명확히 설명하기 위하여 음성인식에 관한 공지 기술에 대한 구체적인 설명은 생략하기로 한다.On the other hand, in one embodiment, the interactive speech recognition server 10 generates a lexical tree generation module 130 for generating a lexical tree for the lexical target to be used for speech recognition of the speech to be recognized The speech recognition module 150 may perform speech recognition using the lexical tree generated by the lexical tree generation module 130. Alternatively, in another embodiment, the speech recognition server 10 performs speech recognition on the speech to be recognized using a lexical pool that can be used universally for all speech recognition requests, and performs speech recognition. When recognizing a vocabulary similar to a vocabulary included in the recognized vocabulary pool, the vocabulary included in the recognized vocabulary pool may be preferentially used. In addition, the interactive voice recognition server 10 may use various well-known voice recognition techniques in performing voice recognition. One of the technical features of the present invention relates to a method in which the voice recognition server 10 and the voice recognition client 20 perform interactive voice recognition in a manner determined by predetermined conversation type information. The technical spirit of the present invention is not limited by the speech recognition technology used by 10), so that a detailed description of the known technology related to speech recognition will be omitted for clarity.

상기 결과전송모듈(160)은 상기 음성인식모듈이 순차적으로 전송된 일련의 인식대상 음성에 대한 음성 인식을 수행하여 얻은 결과를 개별적으로 혹은 한꺼번에 통합하여 상기 대화형 음성인식 클라이언트(20)에 전송할 수 있다.The result transmitting module 160 may transmit the results obtained by the voice recognition module to the interactive voice recognition client 20 by integrating them individually or all at once. have.

한편, 상기 대화형식 정보 및/또는 상기 인식대상 음성 정보는 음성으로 변환될 변환 텍스트를 포함할 수 있는데, 상기 TTS모듈(170)은 이러한 변환 텍스트를 음성으로 변환하는 기능을 수행할 수 있다. 일 실시예에서, 상기 변환 텍스트는 상기 인식대상 음성 정보에 포함될 수 있다. 상기 변환 텍스트는 상기 인식대상 음성 정보에 상응하는 음성 정보를 수신하기 전에 사용자가 어떤 음성 정보를 입력해야 하는지를 안내하기 위한 항목안내 텍스트일 수 있다. 다른 실시예에서, 상기 변환 텍스트는 상기 대화형식 정보에 포함될 수 있으며, 사용자가 앞으로 입력하게 될 음성이 사용될 서비스 등에 관한 안내 텍스트일 수 있다.Meanwhile, the conversation type information and / or the recognition target voice information may include converted text to be converted into speech, and the TTS module 170 may perform a function of converting the converted text into speech. In one embodiment, the converted text may be included in the recognition target voice information. The converted text may be item guide text for guiding what voice information a user should input before receiving voice information corresponding to the recognition target voice information. In another embodiment, the converted text may be included in the conversation type information and may be guide text regarding a service or the like in which a voice to be input by a user is to be used.

도 3은 본 발명의 일 실시예에 따른 대화형 음성인식 클라이언트(20)의 구성을 나타내는 블록도이다.3 is a block diagram showing the configuration of the interactive voice recognition client 20 according to an embodiment of the present invention.

도 3을 참조하면, 상기 음성인식 클라이언트는 입력모듈(200), 정보전송모듈(210), 음성전송모듈(220), 결과 수신모듈(230)을 포함할 수 있으며, 구현 예에 따라서는 TTS모듈(240)을 더 포함할 수 있다.Referring to FIG. 3, the voice recognition client may include an input module 200, an information transmission module 210, a voice transmission module 220, and a result receiving module 230. 240 may further include.

상기 정보전송모듈(210)은 상기 대화형식 정보를 상기 대화형 음성인식 서버(10)에 전송할 수 있다. 상기 입력모듈(200)은 사용자가 발화하는 음성을 입력 받기 위한 소정의 장치일 수 있다. 예를 들어 상기 입력모듈(200)은 적어도 하나의 마이크로폰으로 구성될 수 있다. 상기 음성전송모듈(220)은 상기 입력모듈에 의해 입력된 사용자 음성을 상기 대화형 음성인식 서버(10)로 전송하여 음성인식을 수행하도록 할 수 있다. 상기 결과수신모듈(230)은 상기 대화형 음성인식 서버(10)로부터 수행된 상기 사용자의 음성에 대한 음성인식 결과를 수신할 수 있다. 한편, 상기 대화형 음성인식 클라이언트(20)는 상기 음성인식 결과를 사용자에게 출력하거나, 상기 음성인식 결과에 의해 제어되는 소정의 제어동작을 수행하는 제어모듈(미도시)를 더 포함할 수 있다.The information transmission module 210 may transmit the conversation type information to the interactive voice recognition server 10. The input module 200 may be a predetermined device for receiving a voice input by a user. For example, the input module 200 may be composed of at least one microphone. The voice transmission module 220 may transmit the user voice input by the input module to the interactive voice recognition server 10 to perform voice recognition. The result receiving module 230 may receive a voice recognition result for the user's voice performed from the interactive voice recognition server 10. The interactive voice recognition client 20 may further include a control module (not shown) for outputting the voice recognition result to the user or performing a predetermined control operation controlled by the voice recognition result.

한편, 상술한 항목안내 텍스트나 서비스 등에 관한 안내 텍스트와 같은 변환 텍스트의 음성 변환이 반드시 서버 측에서 수행될 필요는 없다. 즉, 상기 대화형 음성인식 클라이언트(20)가 상기 변환 텍스트의 음성 변환을 수행하고 변환된 음성을 사용자에게 출력할 수도 있는데, 이러한 실시예에서 상기 대화형 음성인식 클라이언트(20)는 TTS모듈(240)을 더 포함할 수 있다.On the other hand, voice conversion of the converted text, such as the guide text for the item guide text or service described above, is not necessarily performed on the server side. That is, the interactive voice recognition client 20 may perform voice conversion of the converted text and output the converted voice to the user. In this embodiment, the interactive voice recognition client 20 may include the TTS module 240. ) May be further included.

본 발명의 실시예에 따라서는, 상술한 구성요소들 중 일부 구성요소는 반드시 본 발명의 구현에 필수적으로 필요한 구성요소에 해당하지 않을 수도 있으며, 또한 실시예에 따라 상기 대화형 음성인식 서버(10) 및/또는 상기 대화형 음성인식 클라이언트(20)는 이보다 더 많은 구성요소를 포함할 수도 있음은 물론이다.According to an embodiment of the present invention, some of the above components may not necessarily correspond to the components necessary for the implementation of the present invention, and also according to the embodiment, the interactive voice recognition server 10 And / or the interactive voice recognition client 20 may include more components than this.

상기 대화형 음성인식 서버(10) 및/또는 상기 대화형 음성인식 클라이언트(20)는 본 발명의 기술적 사상을 구현하기 위해 필요한 하드웨어 리소스(resource) 및/또는 소프트웨어를 구비할 수 있으며, 반드시 하나의 물리적인 구성요소를 의미하거나 하나의 장치를 의미하는 것은 아니다. 즉, 상기 대화형 음성인식 서버(10) 및/또는 상기 대화형 음성인식 클라이언트(20)는 본 발명의 기술적 사상을 구현하기 위해 구비되는 하드웨어 및/또는 소프트웨어의 논리적인 결합을 의미할 수 있으며, 필요한 경우에는 서로 이격된 장치에 설치되어 각각의 기능을 수행함으로써 본 발명의 기술적 사상을 구현하기 위한 논리적인 구성들의 집합으로 구현될 수도 있다. 또한, 상기 대화형 음성인식 서버(10) 및 상기 대화형 음성인식 클라이언트(20)은 본 발명의 기술적 사상을 구현하기 위한 각각의 기능 또는 역할별로 별도로 구현되는 구성들의 집합을 의미할 수도 있다. 예를 들어, 상기 대화형 음성인식 서버(10)의 경우 정보수신모듈(110), 어휘풀결정모듈(120), 어휘트리 생성모듈(130), 음성수신모듈(140), 음성인식모듈(150), 결과전송모듈, TTS모듈(170) 및 저장모듈(180)은 서로 다른 물리적 장치에 위치할 수도 있고, 동일한 물리적 장치에 위치할 수도 있다. 또한, 구현 예에 따라서는 정보수신모듈(110), 어휘풀결정모듈(120), 어휘트리 생성모듈(130), 음성수신모듈(140), 음성인식모듈(150), 결과전송모듈(160), TTS모듈(170) 및 저장모듈(180) 등 각각의 모듈을 구성하는 소프트웨어 및/또는 하드웨어 역시 서로 다른 물리적 장치에 위치하고, 서로 다른 물리적 장치에 위치한 구성들이 서로 유기적으로 결합되어 각각의 모듈들이 수행하는 기능을 실현할 수도 있다.The interactive voice recognition server 10 and / or the interactive voice recognition client 20 may be provided with hardware resources and / or software necessary to implement the technical idea of the present invention. It does not mean a physical component or a device. That is, the interactive voice recognition server 10 and / or the interactive voice recognition client 20 may mean a logical combination of hardware and / or software provided to implement the technical idea of the present invention. If necessary, it may be implemented as a set of logical components for realizing the technical idea of the present invention by installing the devices spaced apart from each other and performing each function. In addition, the interactive voice recognition server 10 and the interactive voice recognition client 20 may refer to a set of components that are separately implemented for each function or role for implementing the technical idea of the present invention. For example, in the case of the interactive voice recognition server 10, the information receiving module 110, the lexical full determination module 120, the lexical tree generation module 130, the voice receiving module 140, the voice recognition module 150 ), The result transfer module, the TTS module 170 and the storage module 180 may be located in different physical devices or may be located in the same physical device. In addition, according to the embodiment, the information receiving module 110, the lexical full determination module 120, the lexical tree generation module 130, the voice receiving module 140, the voice recognition module 150, the result transmission module 160 The software and / or hardware configuring each module, such as the TTS module 170 and the storage module 180, may also be located on different physical devices, and the components located on different physical devices may be organically combined with each other to perform the respective modules. Can be realized.

또한, 본 명세서에서 모듈이라 함은, 본 발명의 기술적 사상을 수행하기 위한 하드웨어 및 상기 하드웨어를 구동하기 위한 소프트웨어의 기능적, 구조적 결합을 의미할 수 있다. 예컨대, 상기 모듈은 소정의 코드와 상기 소정의 코드가 수행되기 위한 하드웨어 리소스의 논리적인 단위를 의미할 수 있으며, 반드시 물리적으로 연결된 코드를 의미하거나, 한 종류의 하드웨어를 의미하는 것은 아님은 통상의 기술자에게는 용이하게 추론될 수 있다.In addition, the term "module" in the present specification may mean a functional and structural combination of hardware for performing the technical idea of the present invention and software for driving the hardware. For example, the module may mean a logical unit of a predetermined code and a hardware resource for performing the predetermined code, and means a code that is not necessarily physically connected or does not mean a kind of hardware. It can be easily deduced by the technician.

도 4a는 본 발명의 일 실시예에 따른 대화형 음성인식 클라이언트(20)에서 제공하는 스케쥴 등록 서비스의 UI를 나타내는 도면이다. 도 4a에 도시된 바와 같은 상기 스케쥴 등록 서비스는 상기 대화형 음성인식 클라이언트(20)에서 동작하는 애플리케이션에 의해 제공될 수 있다. 도 4a를 참조하면, 사용자는 새로운 스케쥴을 등록하기 위하여, 날짜 필드(a21), 시간 필드(a22), 장소 필드(a23) 및 스케쥴 내용 필드(a24)를 음성으로 입력할 수 있다.4A is a diagram illustrating a UI of a schedule registration service provided by the interactive voice recognition client 20 according to an embodiment of the present invention. The schedule registration service as shown in FIG. 4A may be provided by an application running on the interactive voice recognition client 20. Referring to FIG. 4A, a user may voice input a date field a21, a time field a22, a place field a23, and a schedule content field a24 to register a new schedule.

도 4b는 도 4a에 도시된 스케쥴 등록 서비스에 상응하는 대화형식 정보의 일 예를 나타내는 도면이다. 도 4b에 도시된 대화형식 정보(DF1)은 스케쥴 서비스임을 나타내는 서비스 안내 텍스트 섹션(OP1), 도 4a데 도시된 스케쥴 등록 서비스의 각 입력 필드에 상응하는 인식대상 음성 정보(SI11 내지 SI14)를 포함하는 섹션(SF1), 음성 인식을 모두 수행한 후에 사용자에게 출력될 종료 안내 텍스트 섹션(CL1) 및 "월" 입력 필드에 상응하는 인식 대상 음성에 대한 음성 인식에 이용될 단어 집합(WS)를 포함할 수 있다. 여기서, 서비스 안내 텍스트 섹션(OP1), 종료 안내 텍스트 섹션(CL1)은, 구현예에 따라서는, 대화형식 정보에 반드시 필수적으로 포함되어야 하는 항목은 아니다. 또한, 상기 대화형식 정보에 포함된 모든 인식 대상 음성에 상응하는 단어 집합이 상기 대화형식 정보에 포함될 필요는 없으며, 단어 집합을 하나도 포함하지 않는 대화형식 정보도 있을 수 있다(도 5 참조).FIG. 4B is a diagram illustrating an example of conversation type information corresponding to the schedule registration service illustrated in FIG. 4A. The dialogue format information DF1 shown in FIG. 4B includes a service guide text section OP1 indicating that the schedule service is a schedule service, and voice recognition information SI11 to SI14 corresponding to each input field of the schedule registration service shown in FIG. 4A. Section SF1, an end guide text section CL1 to be output to the user after performing all speech recognition, and a word set WS to be used for speech recognition of the speech to be recognized corresponding to the "month" input field. can do. Here, the service guide text section OP1 and the end guide text section CL1 are not necessarily items included in the dialogue format information, depending on the embodiment. In addition, the word set corresponding to all the recognition target voices included in the conversation type information need not be included in the conversation type information, and there may be conversation type information that does not include any word set (see FIG. 5).

한편, 서비스 안내 텍스트 섹션(OP1), 종료 안내 텍스트 섹션(CL1), 인식대상 음성 정보(SI11 내지 SI14)의 섹션(SF1)으로 구성되는 한 셋트의 대화는 소정의 구분 구문(도 4b의 예에서는 "RecognitionType:Dialog")에 의해 구분될 수 있으며, 각각의 어휘집합도 그에 상응하는 구분 구문(도 4b의 예에서는 "RecognitionType:Month")에 의해 구분될 수 있다. 또한, 서비스 안내 텍스트 섹션(OP1), 종료 안내 텍스트 섹션(CL1) 및 인식대상 음성 정보(SI11 내지 SI14)를 포함하는 섹션(SF1)도 각각 [OPENING], [CLOSING] 및 [SLOT-FILLING]과 같은 구분 기호에 의해 구분될 수 있다. 상기 대화형 음성인식 서버(10)는 이와 같은 구분 구문이나 구분 기호를 이용하여 상기 대화형식 정보를 분석(파싱)하고, 대화 구조를 파악하여 그에 상응하는 행동(텍스트-음성 변환이나 음성 인식)을 취할 수 있다.On the other hand, a set of conversations composed of the service guide text section OP1, the end guide text section CL1, and the section SF1 of the voice information SI11 to SI14 to be recognized may have a predetermined division syntax (in the example of FIG. 4B). "RecognitionType: Dialog"), and each lexicon may also be distinguished by a corresponding syntax ("RecognitionType: Month" in the example of FIG. 4B). In addition, the section SF1 including the service guidance text section OP1, the termination guidance text section CL1, and the recognition target voice information SI11 to SI14 is also referred to as [OPENING], [CLOSING] and [SLOT-FILLING], respectively. Can be separated by the same delimiter. The interactive speech recognition server 10 analyzes (parses) the conversational format information by using such a phrase or a delimiter, grasps the dialogue structure, and performs a corresponding action (text-to-speech or speech recognition). Can be taken.

한편, 인식대상 음성 정보(SI11 내지 SI14)는 각각 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 음성 인식 결과가 저장될 결과 변수 및/또는 상기 인식대상 음성에 대한 음성인식에 이용될 어휘풀의 식별정보를 포함할 수 있다. 예를 들면, "월" 입력 필드에 상응하는 인식대상 음성 정보는 어휘풀 식별정보 Month 및 결과 변수 <month>를 포함할 수 있다.On the other hand, the recognition target voice information SI11 to SI14 is a result variable in which the voice recognition result of the recognition target voice corresponding to the recognition target voice information is stored, and / or the lexical pool to be used for the voice recognition for the recognition target voice. It may include identification information. For example, the recognition target voice information corresponding to the “month” input field may include the lexical identification information Month and the result variable <month>.

상기 대화형식 정보(DF1)는 어휘풀식별정보 "Month"에 의해 식별되는 어휘집합("일월", "이월", …, "십이월"; WS)을 포함할 수 있으며, 이 경우, 상기 대화형식 정보(DF1)를 수신한 상기 대화형 음성인식 서버(10)는 상기 어휘집합(WS)를 이용하여 "월" 에 상응하는 인식 대상 음성에 대한 음성 인식을 수행할 수 있다.The conversation format information DF1 may include a lexical set ("January", "February", ..., "Twelve month"; WS) identified by the lexical pool identification information "Month". In this case, the conversation format The interactive voice recognition server 10 having received the information DF1 may perform voice recognition on a voice to be recognized corresponding to "month" using the lexical set WS.

본 발명에서, 대화형식 정보는 도 4b와 같은 스크립트의 형태일 수도 있지만, 반드시 이에 한정되는 것은 아니다. 예를 들어, 상기 음성인식 클라이언트(20)는 본 발명의 기술적 사상을 달성하기 위한 소정의 정보를 인코딩하고 인코딩된 정보를 상기 대화형식 정보로서 상기 음성인식 서버(10)에 전송할 수도 있다.In the present invention, the dialogue format information may be in the form of a script as shown in FIG. 4B, but is not necessarily limited thereto. For example, the voice recognition client 20 may encode predetermined information for achieving the technical idea of the present invention and transmit the encoded information to the voice recognition server 10 as the dialogue format information.

도 4c는 본 발명의 일 실시예에 따른 음성인식 클라이언트(20)가 상기 음성인식 서버(10)로부터 음성 인식 결과를 수신하여, 도 4a에 도시된 애플리케이션의 입력 필드를 모두 채운 후의 모습을 나타내는 도면이다.4C is a diagram illustrating a state in which the voice recognition client 20 according to an embodiment of the present invention receives the voice recognition result from the voice recognition server 10 and fills all input fields of the application shown in FIG. 4A. to be.

도 4a에 도시된 스케쥴 등록 서비스는 텍스트 입력 필드만을 포함하고 있으며, 서비스에 따라서는 텍스트 입력 필드 이외에도 라디오 박스, 체크박스, 확인 버튼, 리스트 박스 등의 다양한 형태의 입력 필드도 음성으로 입력이 가능하며, 상기 서비스에 상응하는 대화형식 정보도 있을 수 있음을 본 발명이 속하는 분야에서 통상의 지식을 가진 자는 용이하게 이해할 수 있을 것이다. 예를 들어, 상기 음성인식 클라이언트(20)가 제공하는 서비스의 UI가 "소설", "시집", "자기계발서", "전공서적" 중 어느 하나를 선택해야 하는 리스트 박스를 포함한다고 하면, 상기 서비스에 상응하는 대화형식 정보는 The schedule registration service illustrated in FIG. 4A includes only a text input field, and in addition to the text input field, various types of input fields such as a radio box, a check box, a confirmation button, and a list box may be input by voice. Those skilled in the art can readily understand that there may be conversational information corresponding to the service. For example, if the UI of the service provided by the voice recognition client 20 includes a list box that should select any one of "fiction", "poetry", "self-issued", "major books", The conversational information corresponding to the service

『Speak: 소설, 시집, 자기계발서, 전공서적 중 어느 하나를 선택하세요.『Speak: Please choose one among novels, poems, self-help books, major books.

Book: <book>』Book: <book>

와 같은 형태의 인식대상 음성 정보를 포함할 수 있다.It may include the recognition target voice information of the form.

한편, 대화 형식 정보는 도 5에 도시된 바와 같이 어휘집합을 하나도 포함하지 않을 수 있다. 도 5에 도시된 대화 형식 정보는 어휘집합을 하나도 포함하지 않는 것을 제외하면 도 4a에 도시된 것과 동일하므로 상세한 설명은 생략하기로 한다.Meanwhile, the conversation type information may not include any vocabulary set as shown in FIG. 5. The conversational form information shown in FIG. 5 is the same as that shown in FIG. 4A except that it does not include any vocabulary sets, and thus detailed description thereof will be omitted.

도 6은 대화 형식 정보가 도4b와 같은 경우 본 발명의 일 실시예에 따른 음성인식 클라이언트 및 서버가 동작하는 방법을 나타내는 도면이다.6 is a diagram illustrating a method of operating a voice recognition client and a server according to an embodiment of the present invention when the conversation type information is the same as that of FIG. 4B.

도 6을 참조하면, 상기 대화형 음성인식 클라이언트(20)의 정보전송모듈(210)이 전송한 대화형식 정보 (DF1)을 상기 대화형 음성인식 서버(10)의 정보수신모듈(110)이 수신하면(S61), 상기 음성인식 서버(10)의 파싱모듈(미도시)이 상기 대화형식 정보(DF1)을 분석(파싱)하여 서비스 안내 텍스트 섹션(OP1), 종료 안내 텍스트 섹션(CL1) 및 인식대상 음성 정보(SI11 내지 SI15), 어휘집합(WS)을 추출할 수 있다(S62).Referring to FIG. 6, the information receiving module 110 of the interactive voice recognition server 10 receives the dialogue format information DF1 transmitted by the information transmission module 210 of the interactive voice recognition client 20. At step S61, a parsing module (not shown) of the voice recognition server 10 analyzes (parses) the conversational format information DF1 to recognize a service guide text section OP1, an end guide text section CL1, and recognition. The target voice information SI11 to SI15 and the lexical set WS may be extracted (S62).

이후, 상기 대화형 음성인식 서버(10)의 어휘풀 결정모듈(120)은 상기 어휘집합(WS)가 어휘풀식별정보 "Month"에 의해 식별된다고 판단하고 상기 어휘집합(WS)를 "Month"를 포함하는 인식대상 음성 정보(SI11)에 상응하는 인식대상 음성의 인식대상어휘풀로 설정할 수 있다(S63). Thereafter, the lexical pool determination module 120 of the interactive voice recognition server 10 determines that the lexical set WS is identified by the lexical pool identification information "Month" and sets the lexical set WS to "Month". In operation S63, a target object pool of the recognition target voice corresponding to the recognition target voice information SI11 may be set.

실시예에 따라서, 상기 대화형 음성인식 서버(10)의 저장 모듈(120)은 상기 어휘집합(WS)가 어휘풀식별정보 "Month"에 의해 식별되도록 데이터베이스(30)에 저장할 수도 있다(S64).According to an embodiment, the storage module 120 of the interactive voice recognition server 10 may store the lexicon WS in the database 30 so that the lexical set WS is identified by the lexical pool identification information "Month" (S64). .

한편, 상기 어휘풀 결정모듈(120)은 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보(DF1)에 포함되어 있지 않은 경우(예를 들면, "Date", "Time", "Place", "Memo")에는, 상기 데이터베이스(30)에서 어휘풀식별정보에 의해 식별되는 어휘집합이 존재하는지를 검사하고 존재하는 경우 데이터베이스에 저장된 어휘집합을 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상 어휘풀로 결정할 수 있다. 다만, 이에 한정되는 것은 아니며, 상기 어휘풀 결정모듈(120)은 상기 인식대상 음성 정보에 포함된 어휘풀식별정보에 의해 식별되는 어휘집합이 상기 대화형식 정보(DF1)에 포함되어 있지 않은 경우 일반 어휘 사전을 상기 인식대상 음성 정보에 상응하는 인식대상 음성의 인식대상 어휘풀로 결정할 수도 있다.On the other hand, the lexical pool determining module 120 does not include the lexical set identified by the lexical pool identification information included in the speech object information included in the conversation format information DF1 (for example, "Date". &Quot;, " Time ", " Place ", " Memo " The recognition target lexicon of the recognition target speech corresponding to the information may be determined. However, the present invention is not limited thereto, and the lexical pool determination module 120 may determine that the lexical set identified by the lexical pool identification information included in the speech information to be recognized is not included in the conversation format information DF1. The lexical dictionary may be determined as a recognition target lexicon of the recognition target voice corresponding to the recognition target voice information.

상기 대화형식 정보(DF1)가 서비스 안내 텍스트 섹션(OP1)을 포함하고 있으므로, 상기 대화형 음성인식 서버(10)의 TTS모듈(170)은 "새로운 스케쥴을 등록합니다"를 음성으로 변환하고(S65), 상기 대화형 음성인식 클라이언트(20)에 전송할 수 있다(S66). 그러면, 상기 대화형 음성인식 클라이언트(20)는 소정의 출력모듈(예를 들면, 스피커, 이어폰 등)을 통해 이를 출력할 수 있다.Since the dialogue format information DF1 includes the service guide text section OP1, the TTS module 170 of the interactive speech recognition server 10 converts "register a new schedule" into speech (S65). ), It may be transmitted to the interactive voice recognition client 20 (S66). Then, the interactive voice recognition client 20 may output it through a predetermined output module (for example, a speaker or earphone).

이후, 대화형식 정보(DF1)에 포함된 "월", "일", "시간", "장소", "스케쥴 내용"에 상응하는 텍스트-음성 변환 및 음성 인식 과정이 순차적으로 수행될 수 있다.Subsequently, text-to-speech conversion and speech recognition processes corresponding to "month", "day", "time", "place", and "schedule content" included in the dialogue format information DF1 may be sequentially performed.

예를 들면, 상기 대화형 음성인식 서버(10)의 TTS모듈(170)은 "월을 말하세요"를 음성으로 변환하고(S67), 상기 대화형 음성인식 클라이언트(20)에 전송할 수 있다(S68). 그러면, 상기 대화형 음성인식 클라이언트(20)는 소정의 출력모듈(예를 들면, 스피커, 이어폰 등)을 통해 이를 출력할 수 있다. 이후, 사용자로부터 "월" 입력 필드에 상응하는 음성 "이월"이 입력되면, 상기 음성인식 클라이언트(20)의 음성전송모듈(220)은 음성 "이월"을 상기 음성인식 서버(10)의 음성수신 모듈(140)로 전송하고(S69), 상기 음성인식 서버(10)의 음성인식모듈(150)은 음성 "이월"에 대한 음성 인식을 수행할 수 있다. 이때, "월" 입력 필드에 상응하는 인식대상 음성의 음성인식에 사용될 어휘집합(WS)이 상기 대화형식 정보에 포함되어 있으므로, 음성인식모듈(150)은 상기 어휘집합(WS)를 이용하여 "이월"에 대한 음성인식을 수행하고 "월" 입력 필드에 상응하는 결과 변수 <month>에 음성 인식 결과를 저장할 수 있다(S71).For example, the TTS module 170 of the interactive voice recognition server 10 may convert "speak month" into voice (S67) and transmit the voice to the interactive voice recognition client 20 (S68). ). Then, the interactive voice recognition client 20 may output it through a predetermined output module (for example, a speaker or earphone). Thereafter, when a voice "carryover" corresponding to the "month" input field is input from the user, the voice transmission module 220 of the voice recognition client 20 receives the voice "carryover" of the voice recognition server 10. Sending to the module 140 (S69), the speech recognition module 150 of the speech recognition server 10 may perform speech recognition for the voice "carry over". In this case, since the lexical set WS to be used for speech recognition of the speech to be recognized corresponding to the "month" input field is included in the dialogue format information, the speech recognition module 150 uses the lexical set WS to select " Speech recognition for the "forward" may be performed and the speech recognition result may be stored in the result variable <month> corresponding to the "month" input field (S71).

한편, 구현 예에 따라서, 상기 대화형 음성인식 서버(10)는 사용자가 발화한 음성의 인식 신뢰도가 소정의 임계치 이하인 경우, 사용자가 음성을 재입력하도록 할 수 있다. 예를 들면, 음성인식모듈(150) 수행한 음성 인식 결과의 인식 신뢰도가 소정의 임계치 이하인 경우, 상기 대화형 음성인식 서버(10)는 TTS모듈(170)을 통하여 소정의 재입력 안내 메시지(예를 들면, "다시 입력해주세요")를 상기 음성인식 클라이언트에 전송할 수 있다. 상기 재입력 안내 메시지는 상기 대화형 음성인식 서버(10)에 미리 설정되어 있을 수도 있고, 상기 대화형식 정보에 포함되어 있을 수도 있다. 한편, 구현 예에 따라서는, 상기 음성 재입력 메시지의 생성 및 출력을 상기 음성인식 클라이언트 측에서 수행할 수도 있다. 이후, 상기 음성 수신모듈(140)은 사용자가 다시 발화한 인식대상 음성을 수신하고, 음성인식모듈(150)은 이에 대한 음성인식을 다시 수행할 수 있다.According to an embodiment, the interactive voice recognition server 10 may allow the user to input the voice again when the recognition reliability of the voice spoken by the user is less than or equal to a predetermined threshold. For example, when the recognition reliability of the speech recognition result performed by the speech recognition module 150 is less than or equal to a predetermined threshold, the interactive speech recognition server 10 transmits a predetermined re-entry guide message (eg, through the TTS module 170). For example, "Please input again") can be transmitted to the voice recognition client. The re-entry guide message may be set in advance in the interactive voice recognition server 10 or may be included in the dialogue format information. In some implementations, the voice recognition client may generate and output the voice re-input message. Thereafter, the voice receiving module 140 may receive a voice to be recognized again by the user, and the voice recognition module 150 may perform voice recognition again.

이후, 상기 대화형 음성인식 서버(10)는 대화형식 정보(DF1)에 포함된 "일", "시간", "장소", "스케쥴 내용"에 각각 대해서도 항목안내 텍스트를 음성으로 변환하여 상기 대화형 음성인식 클라이언트(20)에 전송하고, 상기 대화형 음성인식 클라이언트(20)로부터 사용자의 인식 대상 음성(예를 들면, "일"의 경우에는 "십이일", "시간"의 경우에는 "아홉시", "장소"의 경우에는 "홍대입구역", "내용"의 경우에는 "미팅")을 입력 받아 음성 인식을 할 수 있다. 그런데, 상기 대화형식 정보(DF1)에는 "일", "시간", "장소", "스케쥴 내용"에 대한 어휘 집합이 포함되어 있지 않으므로, 즉, Date, Time, Place, Memo에 의해 식별되는 어휘집합이 상기 대화형식 정보(DF1)에는 포함되어 있지 않으므로 상기 대화형 음성인식 서버(10)는 데이터베이스(30)에 저장된 일반 사전을 이용하거나 Date, Time, Place, Memo에 의해 식별되는 어휘풀이 상기 데이터베이스(30)에 저장되어 있는지 검사하고, 저장되어 있는 어휘풀을 이용하여 음성인식을 수행할 수 있다. 또한, 상기 대화형 음성인식 서버(10)는 음성 인식 결과를 각각 <date>, <time>, <place>, <memo>에 저장할 수 있다(S72 내지 S74 참조).Thereafter, the interactive voice recognition server 10 converts the item guide text into voice for each of "day", "time", "place", and "schedule content" included in the dialogue format information DF1. To the voice recognition client 20, and the voice to be recognized by the user from the interactive voice recognition client 20 (for example, "twelve days" in the case of "day", "nine" in the case of "time" "Hongik University" and "Meeting" in the case of "Place" and "Place" can be used for voice recognition. However, since the conversation form information DF1 does not include a vocabulary set for "day", "time", "place", and "schedule content", that is, the vocabulary identified by Date, Time, Place, and Memo. Since the set is not included in the conversational information DF1, the interactive speech recognition server 10 uses the general dictionary stored in the database 30 or the lexical pool identified by Date, Time, Place, Memo. It may be checked whether the data is stored at 30, and voice recognition may be performed using the stored lexicon. In addition, the interactive voice recognition server 10 may store the voice recognition results in <date>, <time>, <place>, and <memo>, respectively (see S72 to S74).

이후, 대화형식 정보(DF1)에 포함된 인식대상 음성 정보(SI11 내지 SI14)에 상응하는 모든 인식대상 음성에 대하여 음성인식이 수행되었으므로, 상기 대화형 음성인식 서버(10)의 결과전송모듈(160)은 음성인식 결과(<month=2월>, <date=12일>, <time=오전9시>, <place=홍대입구역>, <memo=미팅>)을 상기 대화형 음성인식 클라이언트(20)에 전송할 수 있다.Thereafter, since voice recognition is performed on all recognition target voices corresponding to the recognition target voice information SI11 to SI14 included in the dialogue format information DF1, the result transmission module 160 of the interactive voice recognition server 10 is performed. ) Is a voice recognition result (<month = February>, <date = 12 days>, <time = 9: 00 am>, <place = Hongik University>, <memo = meeting>). ) Can be sent.

또한, 대화형식 정보(DF1)에 결과 안내 텍스트 섹션(CL1)이 있으므로 상기 TTS모듈은 결과 안내 텍스트 (<month>월 <date>일, <time>에, <place>에서 <memo > 스케쥴이 등록되었습니다)에 기초하여 음성 "2월 12일 오전9시 홍대입구역에서 미팅 스케쥴이 동록되었습니다"를 생성하고, 상기 음성인식 클라이언트(20)에 전송할 수 있다.In addition, since there is a result guidance text section CL1 in the dialogue format information DF1, the TTS module registers a result guidance text (<month>, <date>, <time>, <time>, and <memo> schedule in <place>). The meeting schedule was registered in the Hongik University area at 9 am on February 12, and transmitted to the voice recognition client 20.

도 7은 대화 형식 정보가 도5와 같은 경우 본 발명의 일 실시예에 따른 음성인식 클라이언트 및 서버가 동작하는 방법을 나타내는 도면이다.7 is a diagram illustrating a method of operating a voice recognition client and a server according to an embodiment of the present invention when the conversation type information is the same as that of FIG. 5.

상기 대화형 음성인식 클라이언트(20)의 정보전송모듈(210)이 전송한 대화형식 정보 (DF2)을 상기 대화형 음성인식 서버(10)의 정보수신모듈(110)이 수신하면(S81), 상기 음성인식 서버(10)의 파싱모듈(미도시)이 상기 대화형식 정보(DF1)을 분석(파싱)하여 서비스 안내 텍스트 섹션(OP1), 종료 안내 텍스트 섹션(CL1) 및 인식대상 음성 정보(SI21 내지 SI25)를 추출할 수 있다(S82).When the information receiving module 110 of the interactive voice recognition server 10 receives the conversation type information DF2 transmitted by the information transmission module 210 of the interactive voice recognition client 20 (S81), A parsing module (not shown) of the voice recognition server 10 analyzes (parses) the conversational format information DF1 to analyze the service guide text section OP1, the end guide text section CL1, and the recognition target voice information SI21 through. SI25) can be extracted (S82).

이후, 상기 대화형 음성인식 서버(10)의 어휘풀 결정모듈(120)은 데이터베이스(30)에 저장되었던(도 6의 S64 참조) 어휘집합(WS)가 어휘풀식별정보"Month"에 의해 식별된다고 판단하고(S84), 상기 어휘집합(WS)를 "Month"를 포함하는 인식대상 음성 정보(SI21)에 상응하는 인식대상 음성의 인식대상어휘풀로 설정할 수 있다(S63).Thereafter, the lexical pool determination module 120 of the interactive voice recognition server 10 identifies the lexical set WS that is stored in the database 30 (see S64 of FIG. 6) by the lexical pool identification information "Month". In operation S84, the lexical set WS may be set as a recognition target lexicon of the recognition target voice corresponding to the recognition target voice information SI21 including "Month" (S63).

나머지 Date, Time, Place, Memo의 경우에는 그에 의해 식별되는 어휘집합이 데이터베이스에 존재하는 경우에는 데이터베이스에 저장된 어휘집합이 인식대상어휘풀로 설정되며, 그렇지 않은 경우에는 일반 사전이 인식대상어휘풀로 설정될 수 있다For the remaining Date, Time, Place, and Memo, if the lexical set identified by it exists in the database, the lexical set stored in the database is set as the target lexical pool. Otherwise, the general dictionary is converted into the target lexical pool. Can be set

이후 대화형식 정보(DF1)에 포함된 "월", "일", "시간", "장소", "스케쥴 내용"에 상응하는 텍스트-음성 변환 및 음성 인식 과정은 도6에서 설명한 바와 유사하므로 상세한 설명은 생략한다.Since the text-to-speech conversion and speech recognition processes corresponding to "month", "day", "time", "place", and "schedule content" included in the dialogue format information DF1 are similar to those described with reference to FIG. Description is omitted.

본 발명의 실시예에 따르면 대화형 음성인식에 사용될 대화형식 정보는 상기 대화형 음성인식 클라이언트(20)에 의해 생성되고 상기 대화형 음성인식 서버(20)는 클라이언트(10)가 지정한 대화구조에 따라 음성인식을 수행한다. 따라서, 본 실시예에 따르면, 상기 음성인식 클라이언트(10)는 자신이 사용자에게 제공하는 서비스에 특화된 대화구조를 지정하여 상기 대화형 음성인식 서버(10) 음성인식을 수행하도록 함으로써, 상기 대화형 음성인식 클라이언트(20)의 서비스에 부합하는 음성인식 결과를 얻을 수 있는 효과가 있다. 즉, 본 발명의 실시예에 따르면, 상기 대화형 음성인식 클라이언트(20)에서 동작하는 소정의 음성인식 서비스를 개발하고자 하는 자가 용이하게 이용할 수 있는 음성인식 API를 제공할 수 있다.According to an embodiment of the present invention, conversational format information to be used for interactive speech recognition is generated by the interactive speech recognition client 20, and the interactive speech recognition server 20 is configured according to a conversation structure designated by the client 10. Perform voice recognition. Therefore, according to the present embodiment, the voice recognition client 10 specifies a conversation structure specific to a service provided to the user so that the interactive voice recognition server 10 performs voice recognition, thereby providing the interactive voice. The voice recognition result corresponding to the service of the recognition client 20 can be obtained. That is, according to an embodiment of the present invention, a voice recognition API that can be easily used by a person who wants to develop a predetermined voice recognition service operating in the interactive voice recognition client 20 can be provided.

또한, 본 발명의 실시예에 따르면, 인식대상 음성에 대한 음성 인식에 이용되는 어휘풀이 일반 사전이 아닌 클라이언트(20)에 의해 한정된 어휘집합이므로 빠르고 정확한 음성 인식이 가능한 효과가 있다. 통상적인 음성인식의 경우에는 인식에 사용되는 어휘가 방대하여 음성인식에 높은 프로세싱 파워가 요구될 뿐만 아니라, 서로 유사한 단어가 많으므로 인식율이 그만큼 떨어지게 된다. 일반적으로 음성인식의 어려움은 음성인식을 위해 탐색해야 하는 어휘집합의 크기에 따라 대수적(logarithmic)으로 증가한다고 알려져 있다. 하지만 본 실시예에 따르면, 상기 음성인식 모듈(150)은 클라이언트가 제공하는 한정된 어휘집합을 이용하여 음성인식을 수행하므로 빠르고 정확한 인식이 가능하다는 효과가 있다.In addition, according to an embodiment of the present invention, since the lexical pool used for speech recognition of the speech to be recognized is a lexical set defined by the client 20 rather than a general dictionary, there is an effect capable of fast and accurate speech recognition. In the case of the general speech recognition, the vocabulary used for the recognition is enormous so that high processing power is required for the speech recognition, and the recognition rate decreases because there are many similar words. In general, it is known that the difficulty of speech recognition increases logarithmic according to the size of a lexical set to be searched for speech recognition. However, according to the present embodiment, the voice recognition module 150 performs voice recognition using a limited vocabulary set provided by a client, thereby enabling fast and accurate recognition.

또한, 본 발명의 일 실시예에 따른 대화형 음성인식 방법을 이용하면, 음성인식 클라이언트(20)의 사용자는 입력해야 하는 필드를 눈으로 확인할 필요 없이, 입력해야 할 필드에 대한 정보를 듣고 음성으로 해당 필드에 값을 입력할 수 있다. 따라서, 본 발명의 일 실시예에 따른 대화형 음성인식 방법은 특히 시각 장애인을 위한 서비스(또는 애플리케이션)에서 유용하게 이용될 수 있다.In addition, when using the interactive voice recognition method according to an embodiment of the present invention, the user of the voice recognition client 20 does not need to visually identify a field to be input, and listens to information on a field to be input by voice. You can enter a value in the corresponding field. Accordingly, the interactive voice recognition method according to an embodiment of the present invention may be particularly useful in a service (or application) for the visually impaired.

한편, 본 발명의 실시예에 따른 대화형 음성인식 방법은 컴퓨터가 읽을 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 읽을 수 있는 기록 매체에 저장될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.On the other hand, the interactive speech recognition method according to an embodiment of the present invention may be implemented in the form of computer-readable program instructions and stored in a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

기록 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 소프트웨어 분야 당업자에게 공지되어 사용 가능한 것일 수도 있다.The program instructions recorded on the recording medium may be those specially designed and constructed for the present invention, or may be known and available to those skilled in the software art.

컴퓨터로 읽을 수 있는 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 또한 상술한 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, floppy disks, and the like. Included are hardware devices specifically configured to store and execute the same magneto-optical media and program instructions such as ROM, RAM, flash memory, and the like. In addition, the above-described medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, and the like. The computer readable recording medium may also be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner.

프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 전자적으로 정보를 처리하는 장치, 예를 들어, 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Examples of program instructions include machine language code such as those produced by a compiler, as well as devices for processing information electronically using an interpreter or the like, for example, a high-level language code that can be executed by a computer.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다.The foregoing description of the present invention is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be.

그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성요소들도 결합된 형태로 실시될 수 있다.It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타나며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.
It is intended that the present invention covers the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents. .

Claims

(a) receiving, by a speech recognition server, conversational form information, which is information on a conversation structure between the speech recognition server and the client, from the client, wherein the conversational form information is to be spoken sequentially by a user; Including recognition object voice information corresponding to each voice;
(b) receiving, by the voice recognition server, each of the at least one voice to be recognized from the client based on the conversation type information and performing voice recognition; And
(c) the voice recognition server transmitting a voice recognition result for each of the at least one voice to be recognized to the client.

The method of claim 1,
Each of the recognition target voice information includes an item guide text,
The step (b)
The voice recognition server sequentially converts the item guide text included in the recognition target voice information corresponding to the recognition target voice to the client by sequentially converting each of the at least one recognition target voice to the client. Interactive voice recognition method comprising the step of receiving a voice to be recognized.

The method of claim 1,
Each of the at least one recognition target voice information includes a voice recognition result variable,
The step (c)
The voice recognition server, for each of the at least one recognition target voice, storing the result of the voice recognition of the recognition target voice in a voice recognition result variable corresponding to the recognition target voice and transmitting to the client; Interactive voice recognition method.

The method of claim 3,
The conversation type information further includes a result guide text including a voice recognition result variable included in the at least one recognition target voice information.
The interactive voice recognition method,
Generating a result guide voice based on the result guide text, and transmitting the generated result guide voice to the client.

The method of claim 1,
The interactive voice recognition method,
(d) the speech recognition server, before the step (b), for each of the recognition target voice information, based on the recognition target voice information, the recognition target vocabulary of the recognition target voice corresponding to the recognition target voice information; Further comprising determining,
The step (b)
And performing, by the speech recognition server, speech recognition for each of the received at least one speech recognition target voice using the speech recognition lexicon of the speech recognition speech.

The method of claim 5,
Each of the at least one recognition target voice information includes lexical pool identification information for identifying a recognition target lexicon of the recognition target voice corresponding to the recognition target voice information,
The conversational form information further includes at least one vocabulary set.
And each of the at least one lexical set is identified by any one of the lexical pool identification information.

The method according to claim 6,
The step (d)
When the speech recognition server includes the lexical set identified by the lexical pool identification information included in the recognition target speech information for each of the recognition target speech information, the speech set includes the lexicon. And determining the recognition target lexicon of the recognition target speech corresponding to the information.

The method according to claim 6,
The voice recognition method,
The speech recognition server stores the lexical set in the database when the lexical set identified by the lexical pool identification information included in the recognition target voice information is included in the conversation type information for each of the recognition target voice information. More steps,
The step (d)
When the speech recognition server does not include the lexical set identified by the lexical pool identification information included in the recognition target speech information, for each of the recognition target speech information, in the conversation type information. And determining the lexical set identified by the lexical pool identification information as a recognition target lexicon of the recognition target speech corresponding to the recognition target speech information.

(a) transmitting, by a voice recognition client, conversational format information, which is information on a conversation structure between the speech recognition server and the client, to the speech recognition server, wherein the conversational format information is spoken sequentially by the user; Including recognition object voice information corresponding to each of the recognition object voices;
(b) the voice recognition client sequentially transmitting each of the at least one voice to be recognized to the voice recognition server based on the conversation type information; And
and (c) the voice recognition client receiving a voice recognition result for each of the at least one voice to be recognized from the voice recognition server.

10. The method of claim 9,
Each of the recognition target voice information includes an item guide text,
The step (b)
The voice recognition client sequentially converts and outputs the item guide text included in the recognition target voice information corresponding to each of the at least one recognition target voice, and outputs the recognition target voice. Interactive voice recognition method comprising the step of transmitting to the recognition server.

10. The method of claim 9,
Each of the at least one recognition target voice information includes a voice recognition result variable,
The step (c)
And the voice recognition client, for each of the at least one recognition target voice, receiving a voice recognition result variable corresponding to the recognition target voice in which the voice recognition result for the recognition target voice is stored. Recognition method.

A computer-readable recording medium having recorded thereon a computer program for performing the method according to any one of claims 1 to 11.

As an interactive voice recognition server,
An information receiving module for receiving conversation type information, which is information on a conversation structure between the interactive voice recognition server and the client, from the client, wherein the conversation type information is assigned to each of at least one recognition target voice to be sequentially spoken by the user. Including corresponding recognition object speech information;
A voice receiving module sequentially receiving each of the at least one recognition target voice from the client based on the conversation type information;
A voice recognition module for performing voice recognition on each of the at least one recognition target voice; And
And a result transmission module for transmitting a voice recognition result for each of the at least one recognition target voice to the client.

The method of claim 13,
Each of the recognition target voice information includes an item guide text,
The interactive voice recognition server,
Interactive voice recognition further comprises a TTS module for voice conversion of the item guide text included in the recognition target voice information corresponding to the recognition target voice to the client before receiving each of the at least one recognition target voice. server.

The method of claim 13,
Each of the at least one recognition target voice information includes a voice recognition result variable,
The result transmission module,
And for each of the at least one recognition target voice, store the result of the voice recognition of the recognition target voice in a voice recognition result variable corresponding to the recognition target voice and transmit the result to the client.

16. The method of claim 15,
The conversation type information further includes a result guide text including a voice recognition result variable included in the at least one recognition target voice information.
The interactive voice recognition server,
And a TTS module for generating a result guide voice based on the result guide text and transmitting the generated result guide voice to the client.

The method of claim 13,
The interactive voice recognition server,
And a lexical pool determination module for determining the recognition target lexicon of the recognition target speech corresponding to the recognition target speech information, for each of the recognition target speech information,
The voice recognition module,
The speech recognition server of the at least one recognition target voice received, using the recognition target vocabulary of the recognition target voice to perform speech recognition.

18. The method of claim 17,
Each of the at least one recognition target voice information includes lexical pool identification information for identifying a recognition target lexicon of the recognition target voice corresponding to the recognition target voice information,
The conversational form information further includes at least one vocabulary set.
And each of the at least one lexical set is identified by any one of the lexical pool identification information.

19. The method of claim 18,
The lexical pool determining module,
For each of the recognition target speech information, when the lexical set identified by the lexical pool identification information included in the recognition target speech information is included in the conversation type information, the lexical set corresponds to the recognition target speech information. Interactive speech recognition server that determines the speech recognition target pool.

19. The method of claim 18,
The voice recognition server further includes a storage module,
The storage module,
Storing the lexical set in the database when the lexical set identified by the lexical pool identification information included in the recognition target voice information is included in the conversation type information for each of the recognition target voice information,
The lexical pool determining module,
For each of the recognized speech information, the lexical set identified by the lexical pool identification information included in the recognized speech information is included in the lexical pool identification information among the lexical sets stored in the database. And an lexical set identified by the lexical set of the speech to be recognized corresponding to the speech to speech information.

As an interactive voice recognition client,
An information transmission module for transmitting a dialogue format information, which is information about a dialogue structure between the speech recognition server and the interactive speech recognition client, to a speech recognition server, wherein the conversation format information is sequentially uttered by a user. Including speech recognition information corresponding to each speech object;
A voice transmission module for sequentially transmitting each of the at least one recognition target voice to the voice recognition server based on the conversation type information; And
And a result receiving module configured to receive a voice recognition result for each of the at least one voice to be recognized from the voice recognition server.

The method of claim 21,
Each of the recognition target voice information includes an item guide text,
The interactive voice recognition client,
And a TTS module for converting and outputting the item guide text included in the recognition target voice information corresponding to the recognition target voice before transmitting each of the at least one recognition target voice.

The method of claim 21,
Each of the at least one recognition target voice information includes a voice recognition result variable,
The result receiving module,
And receiving a voice recognition result variable corresponding to each of the at least one voice to be recognized, corresponding to the voice to be stored in which the voice recognition result of the voice to be recognized is stored.