KR100777569B1

KR100777569B1 - The speech recognition method and apparatus using multimodal

Info

Publication number: KR100777569B1
Application number: KR1020060091429A
Authority: KR
Inventors: 김원우; 박성준
Original assignee: 주식회사 케이티
Priority date: 2006-09-20
Filing date: 2006-09-20
Publication date: 2007-11-20

Abstract

A multimodal voice recognition method and an apparatus thereof are provided to ensure a high success rate of voice recognition even in a low-spec portable terminal as well as a high performance computing device by performing voice recognition based on a multimodal synchronized consonant and a user's voice. A multimodal voice recognition method comprises the following steps of: receiving a consonant from a user and receiving a user's voice, and synchronizing the consonant and the user's voice(300,301); extracting vocabularies, started from the synchronized consonant, from a pronunciation dictionary(302); creating the extracted vocabularies as a voice recognition dictionary(303); and searching a corresponding vocabulary from the created voice recognition dictionary and recognizing a word or a sentence corresponding to the user's voice(304).

Description

Speech Recognition Method and Apparatus Using Multimodal TECHNICAL FIELD

도 1은 본 발명이 적용되는 단말기에 대한 일실시예 구성도.1 is a configuration diagram of an embodiment of a terminal to which the present invention is applied.

도 2는 본 발명에서 사용하는 음성 인식 과정을 보여주기 위한 일실시예 설명도.Figure 2 is an illustration of an embodiment for showing a speech recognition process used in the present invention.

도 3은 본 발명에 따른 멀티모달을 이용한 음성 인식 방법에 대한 일실시예 순서도.Figure 3 is a flow chart of an embodiment of a speech recognition method using multi-modal in accordance with the present invention.

* 도면의 주요 부분에 대한 부호 설명* Explanation of symbols on the main parts of the drawing

10 : 키버튼 20 : 마이크10: key button 20: microphone

30 : 문자 처리부 40 : 음성 처리부30: text processing unit 40: voice processing unit

50 : 멀티모달 동기화부 60 : 음성 인식부50: multi-modal synchronization unit 60: speech recognition unit

70 : 인식 DB70: recognition DB

본 발명은 음성 인식 방법 및 그 장치에 관한 것으로, 더욱 상세하게는 멀티모달(multimodal), 예컨대 사용자가 키버튼 등을 통해 문자를 입력시키고 마이크 등을 통해 음성을 발화하는 경우에, 이 문자로 시작되는 어휘로서 음성 인식 대상을 축소시켜 이 사용자로부터 발화되었던 상기 음성을 인식하는, 멀티모달을 이용한 음성 인식 방법 및 그 장치에 관한 것이다.The present invention relates to a speech recognition method and apparatus, and more particularly to multimodal, for example, when a user inputs a character through a key button or the like and utters a voice through a microphone or the like. The present invention relates to a voice recognition method using multi-modal and a device for recognizing the voice that has been spoken by the user by reducing the voice recognition target as a vocabulary.

멀티모달(multimodal)이란 사람과 머신과의 인터페이스에서 여러 가지 형태의 입출력 방식을 사용한다는 것으로서, 멀티모달 입력의 경우 음성 인식, 키보드, 펜, 센서 등을 함께 활용될 수 있다. 본 발명에서는 사용자가 키버튼을 눌러 문자를 입력하고 마이크로 음성을 입력하는 것과 같이 사용자가 손과 입으로서 특정 정보를 입력한다. 특히, 본 발명에서는 음성 인식에 있어 사용자가 발화하고자 하는 음성에 대응되는 어휘의 시작 문자를 키버튼을 사용해 입력시킴과 동시에 그 음성을 마이크로 발화한다.Multimodal means that various types of input / output methods are used in the interface between a person and a machine. In the case of multimodal input, voice recognition, a keyboard, a pen, and a sensor can be used together. In the present invention, the user inputs specific information by hand and mouth, such as a user pressing a key button to input text and a microphone voice. Particularly, in the present invention, in the voice recognition, a start button of a vocabulary corresponding to a voice to be spoken by the user is input using a key button and the voice is uttered as a microphone.

최근에 디지털 컨버전스 추세에 부응해 다양한 기능이 탑재된 단말기가 출시되고 있는데, 이러한 특정 기능을 사용하기 위해서는 사용자가 단말기의 인터페이스를 통해 어떠한 정보를 입력시켜야 된다.Recently, in order to meet the trend of digital convergence, terminals equipped with various functions have been released. In order to use these specific functions, the user must input some information through the interface of the terminal.

예를 들어, 네비게이션 단말기에 있어 사용자는 키버튼을 눌러서 건물명 또는 지명 등과 같은 단어를 입력시켜 출발지부터 목적지까지의 교통 정보를 제공받고 있다.For example, in a navigation terminal, a user is provided with traffic information from a starting point to a destination by inputting a word such as a building name or a place name by pressing a key button.

다른 예로, PDA 등에 있어 특정 일자의 스케쥴을 조회하기 위해서는 사용자 가 펜 도구를 사용해 액정화면 상의 아이콘을 선택해 원하는 정보를 획득하고 있다.As another example, in order to inquire a schedule of a specific date in a PDA or the like, a user selects an icon on the LCD screen using a pen tool to obtain desired information.

또 다른 예로, 휴대형 단말기 등에 있어 그 작은 크기로 인해 위와 같은 키버튼 입력 방식, 펜 도구 입력 방식이 사용자의 사용 번거로움을 주는 점을 감안해, 액정화면 상에 가상 키보드를 표시해 사용자가 문자 입력을 편리하게 할 수 있도록 하는 기술도 있다.As another example, in view of the fact that the above-described key button input method and pen tool input method is troublesome for the user due to the small size of the portable terminal, a virtual keyboard is displayed on the LCD screen so that the user can easily enter text. There is also a technique to make it possible.

최근에는 단말기의 성능이 향상되는 점을 고려해 상기와 같은 텍스트 기반의 정보 입력 방식의 한계[예; 하나의 단어를 완성하기 위해 여러 번의 키버튼 입력을 조합해야 되며, 펜 도구를 사용해 액정화면을 클릭하기가 용이하지 않으며, 특히 이동 중, 운전 중과 같이 눈과 손이 자유롭지 못하는 환경에 있어서 텍스트 입력의 번거로움 및 정확성이 떨어지는 점 등]를 극복하기 위해, 음성 인식 기능이 탑재된 단말기(이하 "음성 인식 장치"라 함)가 활발히 연구되고 있다. 이러한 음성 인식 장치는 사용자로부터 발화되는 음성을 인식해 그 인식 결과로서 정보를 처리한다.Recently, considering the performance of the terminal, the limitation of the text-based information input method as described above [Example; You need to combine several key button inputs to complete a word, and it is not easy to use the pen tool to click on the LCD screen, especially in an environment where your eyes and hands are not free, such as while moving or driving. In order to overcome the hassle and accuracy, etc.], a terminal equipped with a speech recognition function (hereinafter referred to as a "speech recognition device") has been actively studied. Such a speech recognition apparatus recognizes speech spoken by a user and processes information as a result of the recognition.

앞서 언급한 음성 인식 장치에 있어 가장 중요한 문제는 사용자의 음성을 얼마만큼 인식해 낼 수 있는가 인데[음성 인식 성공률], 현재의 기술로는 고성능 컴퓨팅 장치에 있어서도 음성 인식 성공률을 100% 보장하지는 못하고 있으며, 더군다나 휴대형 단말기에 있어서는 낮은 음성 인식 성공률을 보이고 있다.The most important problem for the above-mentioned speech recognition device is how much the user's voice can be recognized [voice recognition success rate], but current technology does not guarantee 100% speech recognition success rate even for high-performance computing devices. In addition, the mobile terminal has a low voice recognition success rate.

한편, 위에서 제시한 음성 인식 성공률을 높이기 위한 종래 기술로는 대한민국 등록특허 제474253호(발명의 명칭: "단어의 첫 자음 발성을 이용한 음성인식 방법 및 이를 저장한 기록 매체")가 있다.Meanwhile, Korean Patent No. 474253 (name of the invention: "a voice recognition method using the first consonant of a word and a recording medium storing the same") as a related art for increasing the speech recognition success rate presented above.

상기 종래 기술에서는 발화 예정 음성에 대응되는 단어의 첫자음을 먼저 발화하고 이어서 실질적인 단어를 발화하도록 사용자에게 요구해, 이 사용자가 발화한 첫자음으로 시작되는 어휘로서 음성 인식 대상을 축소시켜 나중에 발화되는 실질적인 음성을 인식하는 방식을 개시하고 있다.In the prior art, the first consonant of a word corresponding to a speech to be spoken is first asked, and then a user is asked to speak a substantial word, thereby reducing the object of speech recognition as a vocabulary starting with the first consonant spoken by the user, thereby realizing the actual speech to be spoken later. Disclosed is a method of recognizing speech.

그런데, 상기 종래 기술은 첫자음으로 시작되는 어휘로서 음성 인식 대상을 축소시킴으로서 인식 DB에 저장되어 있는 단어 조회 수를 줄일 수 있기에 그 음성 인식 처리 시간을 단축할 수는 있으나, 그 음성 인식 성공률을 보장하기에는 무리가 있다.However, the prior art can reduce the number of word lookups stored in the recognition DB by reducing the speech recognition target as a vocabulary starting with the first consonant, thereby reducing the speech recognition processing time, but guaranteeing the success rate of the speech recognition. There is no way down.

예컨대, 음성 인식 분야에 있어 인식 대상 단어의 시작점을 찾는 것이 음성 인식 성공률을 크게 좌우하고 있는데, 한국어에 있어 그 언어 특성 상 초성이 대부분 자음, 특히 무성음인 점을 고려컨대 단어의 첫자음을 제대로 음성 인식할 수 없는 형편이다.For example, in the field of speech recognition, finding the starting point of a word to be recognized greatly influences the success rate of speech recognition. In Korean, the first consonant of a word is correctly spoken considering that the first consonants are mostly consonants, especially unvoiced sounds. It's unrecognizable.

이를 상세히 설명하자면, 음성 인식 기술에서는 음성 에너지 검출을 통해 인식하고자 하는 음성의 시작점을 찾고, 그 이후에 사용자 발화 음성에서 특징 벡터를 추출해 음성 인식 기능을 수행하는데, 유성음이 성대의 진동 에너지를 갖는 것에 반해 무성음이 폐의 공기가 조음기관(입술, 치아, 혀, 비강, 구강 등)을 통해 발성되는 점을 참작하면, 작은 에너지를 갖는 무성음, 특히 "ㄱ", "ㅅ", "ㅎ" 등을 음성 인식으로서 검출한다는 것은 현실적으로 매우 어려운 문제이다.In detail, the speech recognition technology uses voice energy detection to find the starting point of the voice to be recognized, and then extracts the feature vector from the user's spoken voice to perform the voice recognition function. On the other hand, considering that unvoiced sound is produced through the arteries of the lungs (lips, teeth, tongue, nasal cavity, oral cavity), unvoiced sounds with small energy, especially "ㄱ", "ㅅ", "ㅎ" Detecting as speech recognition is a very difficult problem in reality.

본 발명은 상기와 같은 문제점을 해결하고 상기와 같은 요구에 부응하기 위하여 제안된 것으로, 사용자가 키버튼 등을 통해 문자를 입력시키고 마이크 등을 통해 음성을 발화하는 경우에, 문자 입력 시점을 기준으로 사용자 음성의 시작점을 찾을 준비를 하고, 이 문자로 시작되는 어휘로서 음성 인식 대상을 축소시켜 이 사용자로부터 발화되었던 상기 음성을 인식하는, 멀티모달을 이용한 음성 인식 방법 및 그 장치를 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems and to meet the above demands, and when a user inputs a character through a key button and utters a voice through a microphone or the like, based on the character input time point. The present invention provides a method and apparatus for recognizing speech using a multi-modal, which is prepared to find a starting point of a user's voice and reduces the object of speech recognition as a vocabulary starting with this character to recognize the speech spoken by the user. have.

상기의 목적을 달성하기 위한 본 발명의 방법은, 멀티모달(multimodal)을 이용한 음성 인식 방법에 있어서, 사용자로부터 자음을 입력받고 동시에 그 사용자 발화 음성을 입력받으면 이 자음과 이 사용자 발화 음성간을 동기화시키는 단계; 상기 동기화가 이루어진 자음으로 시작되는 어휘들을 발음 사전으로부터 추출하는 단계; 상기 추출한 어휘들을 묶어서 음성 인식 대상 사전으로서 생성하는 단계; 및 상기 음성 인식 대상 사전을 생성한 상태에서, 상기 동기화가 이루어진 사용자 발화 음성에 대해 이 생성한 음성 인식 대상 사전으로부터 해당 어휘를 탐색해 그 사용자 발화 음성에 대응되는 단어 또는 문장을 인식하는 단계를 포함한다.According to a method of the present invention for achieving the above object, in a speech recognition method using multimodal, if a consonant is input from a user and the user's spoken voice is simultaneously received, the consonant and the user's spoken voice are synchronized. Making a step; Extracting vocabulary starting from the synchronized consonant from a pronunciation dictionary; Grouping the extracted vocabularies and generating them as a speech recognition target dictionary; And retrieving a corresponding vocabulary from the generated voice recognition target dictionary for the synchronized user spoken voice while generating the voice recognition target dictionary and recognizing a word or sentence corresponding to the user spoken voice. do.

한편, 본 발명의 장치는, 멀티모달(multimodal)을 이용한 음성 인식 장치에 있어서, 사용자로부터 입력받은 텍스트신호를 인지해 그에 대응되는 자음을 문자로서 처리해 멀티모달 동기화부로 전달하는 문자 처리부; 사용자로부터 입력받은 음성신호를 인지해 이 음성신호를 신호 처리해 멀티모달 동기화부로 전달하는 음성 처리부; 상기 문자 처리부로부터 전달받은 자음과 상기 음성 처리부로부터 전달받은 사용자 발화 음성간을 서로 동기화시키는 멀티모달 동기화부; 발음 사전이 저장되어 있는 인식 DB; 및 상기 동기화가 이루어진 자음으로 시작되는 어휘들을 상기 발음 사전으로부터 추출하고서, 이 추출한 어휘들을 묶어서 음성 인식 대상 사전으로서 생성한 후에, 이 생성한 음성 인식 대상 사전으로부터 해당 어휘를 탐색해 상기 동기화가 이루어진 사용자 발화 음성에 대응되는 단어 또는 문장을 인식하는 음성 인식부를 포함한다.On the other hand, the apparatus of the present invention, a speech recognition apparatus using a multimodal, a character processing unit for recognizing a text signal received from the user and processing a consonant corresponding to the consonant as a character to the multi-modal synchronization unit; A voice processor for recognizing a voice signal input from a user and processing the voice signal and transmitting the signal to a multimodal synchronizer; A multi-modal synchronization unit for synchronizing the consonant received from the text processing unit with the user spoken voice received from the voice processing unit; A recognition DB in which a pronunciation dictionary is stored; And extracting the vocabulary starting with the consonant from the pronunciation dictionary, combining the extracted vocabularies, generating them as a speech recognition target dictionary, and searching for the vocabulary from the generated speech recognition target dictionary. It includes a speech recognition unit for recognizing a word or sentence corresponding to the spoken voice.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, whereby those skilled in the art may easily implement the technical idea of the present invention. There will be. In addition, in describing the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명이 적용되는 단말기에 대한 일실시예 구성도이다.1 is a diagram illustrating an embodiment of a terminal to which the present invention is applied.

도 1에 도시된 바와 같이, 본 발명이 적용되는 단말기, 즉 음성 인식 장치는 키버튼(10), 마이크(20), 문자 처리부(30), 음성 처리부(40), 멀티모달 동기화부(50), 음성 인식부(60) 및 인식 DB(70)를 포함한다.As shown in FIG. 1, the terminal to which the present invention is applied, that is, the voice recognition apparatus, includes a key button 10, a microphone 20, a text processing unit 30, a voice processing unit 40, and a multi-modal synchronization unit 50. , Speech recognition unit 60 and recognition DB 70.

한편, 본 발명을 설명하는데 있어 텍스트 기반의 정보 입력 수단(인터페이 스)로는 키버튼(10)을 예로 들어 설명하겠으나, 펜 도구(11), 가상 키보드(12), 디지타이저(미도시), 햅틱(미도시) 등과 같은 어떠한 텍스트 입력 수단도 본 발명에 사용될 수 있다. 여기서, 디지타이저 또는 햅틱을 통한 텍스트 입력 방식에 있어 하나의 문자 또는 하나의 단어 등을 인식하는데 한계가 있을 수 있으나, 이하 후술하겠지만 본 발명에 사용되는 디지타이저 또는 햅틱은 단어의 첫 시작 자음만을 인식하면 되기에 텍스트 인식에 큰 어려움은 없음을 당업자라면 쉽게 이해할 수 있을 것이다.Meanwhile, in describing the present invention, the text-based information input means (interface) will be described using the key button 10 as an example, but the pen tool 11, the virtual keyboard 12, the digitizer (not shown), and the haptic ( Any text input means such as not shown) can be used in the present invention. Here, there may be a limit in recognizing a letter or a word in a text input method through a digitizer or a haptic, but as described below, the digitizer or haptic used in the present invention only needs to recognize a first consonant of a word. It will be readily understood by those skilled in the art that there is no great difficulty in text recognition.

본 발명은, 사용자가 단말기[즉 음성 인식 장치, 주; 이러한 음성 인식 장치는 음성 인식 기능이 탑재된 휴대폰, PDA, 네이게이션 단말기, 휴대형 미디어 플레이어, 노트북, 데스크탑 컴퓨터 등과 같이 어떠한 단말기라도 무방함]에 정보를 입력하는데 있어 그 번거로움을 갖지 않도록 하면서도 음성 인식 성공률을 높일 수 있도록 한다.The present invention provides a user with a terminal (i.e. a speech recognition apparatus, a main; Such a speech recognition device has a speech recognition success rate while avoiding the hassle of inputting information into any terminal such as a mobile phone, a PDA, a navigation terminal, a portable media player, a laptop, a desktop computer, etc. To increase.

특히, 본 발명은 음성인식의 대표적인 어려움으로 초성 인식, 즉 무성음의 인식과 인식하고자 하는 음성의 시작점을 찾는 문제를 보완함으로서, 단어, 문장 등의 음성 인식 성공률을 제대로 보장하기 위해 안출된 기술로서, 멀티모달 방식을 사용해 문자와 음성을 사용자로부터 동시에 입력받아 처리하되, 그 문자로 시작되는 어휘로서 음성 인식 대상을 축소시켜 이 사용자로부터 발화되었던 상기 음성을 인식한다.In particular, the present invention is to solve the problem of finding the starting point of speech recognition, that is, unvoiced voice and the voice to be recognized as a representative difficulty of speech recognition, as a technique devised to properly ensure the success rate of speech recognition, such as words, sentences, The multi-modal method is used to simultaneously receive texts and voices from the user and process them, while reducing the object of speech recognition as a vocabulary starting with the letters to recognize the voices spoken by the user.

그럼, 본 발명에서 제시하는 음성 인식 장치의 각 구성 요소에 대해 살펴보기로 하자.Next, each component of the speech recognition apparatus of the present invention will be described.

상기 문자 처리부(30)는 키버튼(10)을 통해 사용자로부터 입력받은 키신호를 인지해 그에 대응되는 단어의 첫자음을 문자로서 처리해 멀티모달 동기화부(50)로 전달한다.The character processor 30 recognizes a key signal input from the user through the key button 10, processes the first consonant of a word corresponding to the character, and transmits the first consonant as a character to the multimodal synchronization unit 50.

예컨대, 사용자는 단어의 첫자음에 해당되는 키버튼을 눌러서 문자를 입력시키며, 이와 동시에 또는 순차적으로 마이크에 위 단어, 즉 발화하고자 하는 정보에 대응되는 음성을 입력시킨다.For example, the user presses a key button corresponding to the first consonant of a word to input a character, and simultaneously or sequentially inputs a voice corresponding to the word, that is, information to be uttered, into the microphone.

한편, 본 발명은 한국어가 구현된 음성 인식 장치에 적용되는 것이 가장 바람직하나, 영어가 구현된 음성 인식 장치에도 적용 가능하며, 이러한 경우에는 사용자는 단어의 첫번째 알파벳을 키버튼으로 눌러서 문자를 입력시키면 된다.On the other hand, the present invention is most preferably applied to a speech recognition device implemented in Korean, it is also applicable to a speech recognition device implemented in English, in this case, if the user inputs a letter by pressing the first alphabet of words with a key button do.

또한, 이하 본 발명의 설명 편의 상 하나의 단어를 그 예로서 설명하겠지만, 여러 개의 단어로 이루어진 문장 등에 있어서도 본 발명이 적용 가능하며, 이는 하기에서 상세히 후술하기로 한다.In addition, one word will be described as an example for convenience of explanation of the present invention, but the present invention is also applicable to a sentence composed of several words, which will be described later in detail.

상기 음성 처리부(40)는 마이크(20)를 통해 사용자로부터 입력받은 음성신호를 인지해 이 음성신호를 처리해 멀티모달 동기화부(50)로 전달한다. 여기서 음성 처리부(40)는 DSP 등과 같이 단순히 신호 처리만을 수행한다. 예컨대 본 발명에서 음성 인식을 위한 실질적인 신호 처리는 하기의 음성 인식부(60)에서 수행됨을 밝혀둔다.The voice processor 40 recognizes a voice signal input from the user through the microphone 20, processes the voice signal, and transmits the voice signal to the multimodal synchronizer 50. In this case, the voice processing unit 40 simply performs signal processing, such as a DSP. For example, in the present invention, it is apparent that the actual signal processing for speech recognition is performed in the following speech recognition unit 60.

상기 멀티모달 동기화부(50)는 문자 처리부(30)로부터 전달받은 단어의 첫자음과 음성 처리부(40)로부터 전달받은 음성간의 앞뒤가 서로 뒤바뀌지 않도록 텍스트/음성 동기화를 수행한다. 예를 들어, 사용자가 "ㄱ"을 문자로서 입력시키고 "감 자"를 음성으로 발화한 경우에는, 이 문자 "ㄱ"과 이 음성 "감자"를 하나로 묶어서 동기화시킨 후에 음성 인식부(60)로 넘겨준다.The multi-modal synchronization unit 50 performs text / voice synchronization so that the front and back of the first consonant of the word received from the text processing unit 30 and the voice received from the voice processing unit 40 are not reversed. For example, if a user inputs "a" as a character and "speaks" a voice, the user recognizes the letter "a" and this voice "potato" as one and synchronizes it with the voice recognition unit 60. Pass it over.

통상의 음성 인식부는 사용자로부터 발화된 음성으로부터 그 특징 벡터를 추출하고서 HMM((Hidden Markov Model) 및 인식 DB를 통해 음성 인식을 수행하는데, 본 발명의 음성 인식부(60)도 공지의 음성 인식 과정을 수행한다.A typical speech recognition unit extracts a feature vector from a speech spoken by a user and performs speech recognition through a Hidden Markov Model (HMM) and a recognition DB. The speech recognition unit 60 of the present invention is also known in a speech recognition process. Do this.

특히, 본 발명에서 제시하는 상기 음성 인식부(60)는 멀티모달 동기화부(50)에서 동기화시킨 단어의 첫자음과 사용자 발화 음성을 파라미터로 하여 음성 인식 과정을 수행하며, 이러한 음성 인식을 수행하는데 있어 인식 DB에 저장되어 있는 발음 사전 상의 어휘 중 상기 첫자음으로 시작되는 어휘만을 음성 인식 대상으로 축소(한정)시켜 그 음성 인식 과정을 수행한다. 이러한 본 발명에서 제시하는 음성 인식 과정에 대해서는 하기의 도 2 및 도 3을 함께 참조하여 상세히 후술하기로 한다.In particular, the speech recognition unit 60 according to the present invention performs a speech recognition process using the first consonant of the word synchronized by the multi-modal synchronization unit 50 and the user spoken voice as parameters, and performs such speech recognition. The speech recognition process is performed by reducing (limiting) only the vocabulary beginning with the first consonant among the vocabulary on the pronunciation dictionary stored in the recognition DB. The speech recognition process proposed by the present invention will be described later in detail with reference to FIGS. 2 and 3.

상기와 같이 음성 인식부(60)에서 인식한 음성 인식 결과는 디스플레이부를 통해 텍스트 형태로 출력되거나 스피커를 통해 음성 형태로 출력시키며, 이에 사용자가 자신이 발화했던 음성에 대한 인식 성공 여부를 확인할 수 있도록 한다. 물론, 네비게이션 단말기에 있어 사용자가 출발지/목적지 정보를 위와 같은 멀티모달 방식으로 입력시킨 경우에는, 디스플레이부를 통해 출발지부터 목적지까지의 지도와 스피커를 통해 교통 안내 멘트가 제공됨을 당업자면 쉽게 이해할 수 있을 것이다.As described above, the voice recognition result recognized by the voice recognition unit 60 is output in the form of text through the display unit or in the form of a voice through the speaker, so that the user can check whether the recognition of the speech spoken by the user is successful. do. Of course, when the user inputs the departure / destination information in the above-described multi-modal manner in the navigation terminal, it will be easily understood by those skilled in the art that the traffic guidance is provided through the display and the map from the starting point to the destination. .

도 2는 본 발명에서 사용하는 음성 인식 과정을 보여주기 위한 일실시예 설명도이며, 도 3은 본 발명에 따른 멀티모달을 이용한 음성 인식 방법에 대한 일실시예 순서도이다.FIG. 2 is a diagram illustrating an embodiment of a speech recognition process used in the present invention, and FIG. 3 is a flowchart of an embodiment of a speech recognition method using multi-modal according to the present invention.

도 2에 도시된 바와 같이, 통상의 음성 인식 과정에 있어 음성 인식부(60)가 HMM(Hidden Markov Model)을 이용하며, 사용자 발화 음성으로부터 그 음성 특징 벡터를 추출하고서(61), 이 추출한 음성 특징 벡터에 대해 인식 DB(70) 상의 발음 사전(71), 언어 모델(72) 및 음향 모델(73)을 탐색해 패턴 매칭(62)을 통해 그 음성에 대응되는 단어, 문장을 인식하는 과정을 거친다.As shown in FIG. 2, in a typical speech recognition process, the speech recognition unit 60 uses a Hidden Markov Model (HMM), extracts a speech feature vector 61 from a user speech, and extracts the extracted speech feature vector. A process of recognizing a word or sentence corresponding to the voice through pattern matching 62 is performed by searching the pronunciation dictionary 71, the language model 72, and the acoustic model 73 on the recognition DB 70 for the feature vector. Rough

상기 발음 사전(71)에 음성 인식 대상 어휘가 저장되어 있는데, 이러한 발음 사전은 한국어의 경우에는 SiTEC의 "CleanSent01"에서 제공하는 발음 사전이 사용될 수 있으며, 영어의 경우에는 Carneige Mellon 대학교에서 제공하는 "CMU dictionary V.0.6"이 사용될 수 있으며, 이외에도 어떠한 발음 사전이 사용되어도 무방하다.In the pronunciation dictionary 71, a speech recognition target vocabulary is stored. In the case of Korean, the pronunciation dictionary provided by SiTEC's "CleanSent01" may be used. In the case of English, the pronunciation dictionary provided by Carneige Mellon University may be used. CMU dictionary V.0.6 "may be used, and any pronunciation dictionary may be used.

본 발명에서는 음성 인식부(60)가 실질적인 음성 인식을 수행하기에 앞서, 사용자로부터 입력받은 자음으로 시작되는 어휘만을 인식 DB(70)에 저장되어 있는 발음 사전(71)으로부터 추출하고서, 이 추출한 어휘들을 묶어서 하나의 음성 인식 대상 사전으로서 생성한다.In the present invention, before the speech recognition unit 60 performs the actual speech recognition, only the vocabulary starting with the consonant received from the user is extracted from the pronunciation dictionary 71 stored in the recognition DB 70, and the extracted vocabulary. Are generated as a single speech recognition target dictionary.

위와 같이 생성된 음성 인식 대상 사전은 발음 사전이 축소된 자료 구조를 갖으며, 이와 같이 생성된 음성 인식 대상 사전으로부터 사용자 발화 음성에 대응되는 어휘를 탐색해 그 실질적인 음성 인식 과정을 수행한다.The speech recognition target dictionary generated as above has a data structure in which a pronunciation dictionary is reduced, and searches for a vocabulary corresponding to a user spoken speech from the speech recognition target dictionary generated as described above, and performs the actual speech recognition process.

특히, 음성 인식 수행 성능을 높이기 위해 인식 DB(70)를 구성하는데 있어 음성 인식 대상 어휘와 각 어휘별 인식에 필요한 정보를 구비한다.In particular, in constructing the recognition DB 70 in order to increase the performance of speech recognition, the speech recognition target vocabulary is provided with information necessary for recognition of each vocabulary.

예컨대, 각 어휘별로 어떠한 음소의 시퀀스로 이루어지는지에 관한 정보, 각 단어가 어떠한 클래스에 속하는지에 관한 정보, 문장 인식을 수행하는데 있어 각 단어의 앞뒤에 올 수 있는 클래스가 어떠한 것인지를 나타내는 정보들이 인식 DB(70) 상에 포함되어져, 발음 사전(71)으로부터 해당 어휘만을 정확하고도 빠르게 추출할 수 있도록 해, 그 음성 인식 대상 사전을 생성할 수 있도록 한다.For example, information about which phoneme sequence is formed for each vocabulary, information about which class each word belongs to, and information indicating which classes may precede or follow each word in performing sentence recognition. It is included on the 70, so that only the corresponding vocabulary can be extracted accurately and quickly from the pronunciation dictionary 71, so that the speech recognition target dictionary can be generated.

위에서 언급한 음성 인식 대상 사전 생성 과정에 있어, 사용자로부터 입력받은 자음, 이러한 자음은 하나 또는 그 이상의 자음을 순서대로 입력받을 수 있는데, 이러한 자음으로만 구성된 어휘들을 묶어서 음성 인식 대상 사전으로서 생성한다. 이와 같은 음성 인식 대상 사전은 멀티모달 동기화가 이루어진 시점 또는 사전에 미리 생성할 수 있는데, 전자의 경우는 만약 단어의 첫번째 자음만이 동기화되었다면 그 자음이 입력된 순간이며, 여러 개의 자음이 동기화되었다면 마지막 자음이 입력된 순간이 음성 인식 대상 사전 생성 시점이 되겠다.In the above-mentioned process of generating a speech recognition target dictionary, a consonant input by a user, such consonants may be input in order of one or more consonants, and are generated as a speech recognition target dictionary by grouping vocabularies composed of only the consonants. Such a voice recognition target dictionary may be generated in advance at the time of the multimodal synchronization or the dictionary. In the former case, if only the first consonant of a word is synchronized, the consonant is inputted. The moment the consonant is input will be the time to generate the speech recognition target dictionary.

앞서 언급한 바와 같이, 본 발명에서는 음성 인식 대상 어휘를 축소하는 것을 그 핵심적 요소로 하고 있는데, 이러한 음성 인식 대상 어휘를 더욱 축소시켜 그 음성 인식 성공률을 높일 수 있다. 이를, 사용자가 하나의 단어를 이루는 각 음절의 첫번째 자음들을 입력시킨 경우를 예로 들어 설명하면 다음과 같다.As mentioned above, in the present invention, reducing the speech recognition target vocabulary is a key element, and the speech recognition success rate can be increased by further reducing the speech recognition target vocabulary. For example, the case where the user inputs the first consonants of each syllable constituting one word is described as follows.

즉, 인식 DB 상의 발음 사전에 "ㄱ"부터 "ㅎ"으로 시작되는 어휘들이 존재할 때, 사용자가 "감기"를 발화하고자 하는 경우에 그 문자로서 입력되는 자음은 "ㄱ ㄱ"이 된다.That is, when there are vocabulary beginning with "h" to "ㅎ" in the pronunciation dictionary on the recognition DB, when the user wants to utter "cold", the consonant input as the letter becomes "ㄱ".

이에, 본 발명에서는 1차적으로 발음 사전으로부터 각 음절의 첫번째 자음이 "ㄱ"인 두("2") 음절의 어휘를 음성 인식 대상 사전 생성을 위한 자료로 선정한다. 예를 들어, 1차적 음성 인식 대상 어휘로는 "가구", "가고", "감기", "고기", "공기", "굽고" 등이 될 것이다.Accordingly, in the present invention, the vocabulary of two ("2") syllables in which the first consonant of each syllable is "a" is selected as a material for generating a speech recognition target dictionary from the pronunciation dictionary. For example, the primary speech recognition target vocabulary may be "furniture", "going", "cold", "meat", "air", "grilling", and the like.

위와 같이 1차적 음성 인식 대상 어휘를 선정한 상태에서 그 단어가 갖는 정보를 이용해 1차적 음성 인식 대상 어휘 중 불필요한 어휘를 제거할 수 있다.As described above, in the state of selecting a primary speech recognition target vocabulary, unnecessary words of the primary speech recognition target vocabulary may be removed by using information of the word.

즉, 사용자가 발화한 시점의 음성에 대응되는 단어가 술어에 해당되면, 1차적 음성 인식 대상 어휘 중 "가구", "감기", "고기", "공기"를 제외하고서 술어에 해당되는 "가고", "굽고"만을 2차적 음성 인식 대상 어휘로 선정해, 음성 인식 대상 사전을 생성할 수 있다.That is, when the word corresponding to the voice at the time of the user's speech corresponds to the predicate, the word "go" corresponding to the predicate except for "furniture", "cold", "meat", and "air" in the target speech recognition words Only "," grilling can be selected as the secondary speech recognition target vocabulary to generate the speech recognition target dictionary.

위와 같은 방식은 사용자가 발화한 음성에 대한 인식이 실패했을 경우, 두번째 자음으로 인식 결과를 재탐색하는 데 이용할 수도 있다.The above method may be used to rescan the recognition result as a second consonant when the user fails to recognize the spoken voice.

다만, 사용자가 발화한 시점의 음성에 대응되는 단어의 정보, 예컨대 문장 상에서 어떠한 관계를 갖는지에 관한 정보가 불분명하다면 앞서 제시한 자음 정보만을 이용해 음성 인식 대상 사전을 생성하는 것이 바람직하다.However, if the information of the word corresponding to the voice of the user's speech, for example, information related to the relationship on the sentence, is unclear, it is preferable to generate the speech recognition target dictionary using only the consonant information presented above.

그럼, 도 3을 참조하여 앞서 언급했던 본 발명의 음성 인식 과정에 대해 그 흐름을 살펴보기로 하자.Next, the flow of the speech recognition process of the present invention mentioned above with reference to FIG. 3 will be described.

사용자로부터 자음을 입력받고 동시에 그 사용자 발화 음성을 입력받으 면(300), 이 자음과 이 사용자 발화 음성간을 서로 동기화시킨다(301).When the consonant is input from the user and the user's spoken voice is simultaneously input (300), the consonant and the user's spoken voice are synchronized with each other (301).

그런후, 상기 멀티모달 동기화가 이루어진 자음으로 시작되는 어휘만을 인식 DB 상의 발음 사전으로부터 추출하고서(302), 이 추출한 어휘들을 묶어서 하나의 음성 인식 대상 사전으로서 생성한다[음성 인식 대상 어휘 축소](303).Thereafter, only the vocabulary starting with the consonant with the multimodal synchronization is extracted from the pronunciation dictionary on the recognition DB (302), and the extracted vocabularies are bundled and generated as one speech recognition target dictionary (the speech recognition target vocabulary reduction) (303). ).

위와 같이 음성 인식 대상 사전을 생성한 상태에서 상기 멀티모달 동기화가 이루어진 사용자 발화 음성에 대해 이 생성한 음성 인식 대상 사전으로부터 해당 어휘를 탐색해 그 사용자 발화 음성에 대응되는 단어, 문장을 인식한다[실질적인 음성 인식 과정 수행](304).In the state where the speech recognition target dictionary is generated as described above, the user searches for the corresponding vocabulary from the generated speech recognition target dictionary for the user spoken speech with the multi-modal synchronization, and recognizes a word or sentence corresponding to the speech spoken by the user [substantially. Performing the speech recognition process] (304).

덧붙여, 전술한 "300" 과정에 있어서 사용자로부터 문자/음성을 입력받는데, 키버튼을 통해 특정 키신호가 입력된 시점 이후로 일정 시간 내에 사용자 음성이 입력되지 않으면 유니모달 방식, 즉 기존의 키버튼 입력 방식으로서 사용자 입력 정보를 처리한다.In addition, in the above-described process "300", a text / voice is input from the user. If a user's voice is not input within a certain time since a specific key signal is input through the key button, the uni-modal method, that is, the existing key button Process user input information as an input method.

물론, 전술한 "300" 과정에 있어서 키버튼을 통해 특정 키신호가 입력되지 않고서 단지 사용자 음성만이 입력되는 경우에는 상기 도 2를 참조해 설명했던 통상의 음성 인식 과정을 수행한다.Of course, in the above-described process "300", when only a user voice is input without a specific key signal being input through the key button, the normal voice recognition process described with reference to FIG. 2 is performed.

전술한 본 발명의 일실시예에서는 하나의 단어를 음성 인식하는 것을 그 예로서 설명하였으나, 하나의 단어에 대한 그 음성 인식 성능을 더욱 높이기 위해 하나의 단어를 이루는 각 음절의 첫번째 자음들만을 키버튼을 통해 사용자로부터 입력받아, 이러한 각 자음들로 각각 시작되는 음절로 이루어진 단어에 대응되는 음성을 인식할 수도 있다. 즉, 이러한 경우에 있어 각 자음들로 각각 시작되는 음절로 이루어진 어휘들을 발음 사전으로부터 추출해 음성 인식 대상 사전으로서 생성한다. 다만, 영어에 있어서는 하나의 단어를 이루는 각 음절을 입력받는 것은 결국 그 단어 자체를 입력받는 경우가 되기에 큰 의미가 없으며, 이러한 경우에는 단어의 첫번째 알파벳만을 입력받는 것이 바람직하다.In the above-described embodiment of the present invention, the speech recognition of one word has been described as an example, but in order to further improve the speech recognition performance of one word, only the first consonants of each syllable forming one word are key buttons. A voice corresponding to a word composed of syllables starting from each of the consonants may be recognized by the user. That is, in this case, the vocabulary composed of syllables starting with each consonant is extracted from the phonetic dictionary and generated as a speech recognition target dictionary. However, in English, receiving each syllable constituting a single word does not mean that the word itself is eventually input. In this case, it is preferable to receive only the first alphabet of the word.

또한, 본 발명의 다른 실시예는 하나의 문장에 대해 음성 인식하는 방법으로서 각 문장을 이루는 각 단어의 첫번째 음절 상의 자음들만을 키버튼을 통해 사용자로부터 입력받아서 이를 음성 인식할 수도 있다. 즉, 이러한 경우에 있어 각 자음으로 각각 시작되는 단어들을 조합하고서, 각 단어에 대응되는 어휘들을 발음 사전으로부터 추출해 음성 인식 대상 사전으로서 생성한다.In addition, another embodiment of the present invention is a method of speech recognition for one sentence, and only the consonants on the first syllable of each word constituting each sentence may be recognized by the user through a key button. That is, in this case, words commenced with each consonant are combined, and the vocabulary corresponding to each word is extracted from the pronunciation dictionary and generated as a speech recognition target dictionary.

전술한 본 발명에 의하면 사용자가 문자와 음성을 동시에 입력할 수 있기 때문에 문자/음성 입력을 위해 메뉴를 여러 번 선택하지 않아도 되며, 이에 편리하게 단말기에 정보를 입력시킬 수 있는 잇점이 있다.According to the present invention described above, since the user can input text and voice at the same time, there is no need to select a menu several times for text / voice input, which has the advantage that information can be conveniently input to the terminal.

한편, 휴대형 단말기에 본 발명을 적용하는데 있어, 작은 크기의 단말기 특성 상 모든 자음, 모음에 대응되는 키버튼을 구비할 수 없으며, 이에 보편적으로 하나의 키버튼에 여러 개의 자음, 모음을 입력할 수 있도록 인터페이스가 구현된다. 예를 들어, 천지인 자판 방식, 나랏글 자판 방식 등을 들 수 있는데, 이러한 자판에서는 "ㄱ", "ㅋ" 및 "ㄱㄱ"을 하나의 키버튼으로 입력하도록 하고 있다.On the other hand, in the application of the present invention to a portable terminal, due to the characteristics of the small size terminal can not be provided with key buttons corresponding to all consonants, vowels, it is possible to input several consonants, vowels in one key button universally Interface is implemented. For example, a celestial keyboard method, a naragle keyboard method, and the like can be given. In such a keyboard, "a", "ㅋ" and "ㄱ" are inputted as one key button.

이에, 본 발명에서도 휴대형 단말기의 자판 구조를 고려해, 사용자가 단어의 첫자음을 입력하는데 있어 동일한 키버튼을 여러 번 누르는 번거로움을 덜어주기 위해, 그 단말기 상의 하나의 키버튼에 대응되는 자음 각각으로 시작되는 어휘를 그 음성 인식 대상 어휘로 설정한다.Accordingly, the present invention also considers the keyboard structure of the portable terminal, in order to save the user from having to press the same key button several times in entering the first consonant of the word, each consonant corresponding to one key button on the terminal The starting vocabulary is set as the speech recognition target vocabulary.

예를 들어, 사용자가 "까망"을 발화하는 경우에 있어 "ㅋ"을 입력해야 겠으나, 본 발명에서는 "ㄱ"을 입력하더라도 "ㅋ"으로 시작되는 어휘를 탐색할 수 있도록 한다. 이는 "ㄱ"과 "ㅋ"이 동일한 버튼 상에 대응되는 경우에 "ㅋ"을 누르기 위해서 "ㄱ" 버튼을 연속 2회 눌러야 되는 번거로움을 없애는 잇점이 있다.For example, when the user utters "black", the user should input "ㅋ". However, in the present invention, even if "a" is input, the vocabulary beginning with "ㅋ" can be searched. This has the advantage of eliminating the hassle of having to press the "a" button twice consecutively in order to press "k" when "a" and "k" correspond to the same button.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form. Since this process can be easily implemented by those skilled in the art will not be described in more detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

상기와 같은 본 발명은 멀티모달 동기화를 이룬 자음과 사용자 발화 음성을 토대로 음성 인식을 수행함으로서, 고성능 컴퓨팅 장치뿐만 아니라 저사양의 휴대형 단말기에서도 매우 높은 음성 인식 성공률을 보장하는 효과가 있다.The present invention as described above by performing the speech recognition based on the consonant and the user uttered speech with the multi-modal synchronization, there is an effect of ensuring a very high speech recognition success rate not only in a high-performance computing device but also a low-end portable terminal.

또한, 본 발명은 멀티모달 방식을 채용해 음성 인식 대상 단어의 시작점을 정확하게 찾고, 한국어에 있어 인식이 어려운 무성음 초성에 대해서도 정확하게 음성 인식을 수행할 수 있도록 하는 효과가 있다.In addition, the present invention has the effect of accurately finding the starting point of the words to be recognized by the speech recognition using a multi-modal method, and accurately perform the speech recognition even for unvoiced initial speech difficult to recognize in Korean.

또한, 본 발명은 사용자가 정보를 입력하는데 있어 수많은 키 조작을 하지 않아도 되며, 또한 이동 중, 운전 중과 같이 눈과 손이 자유롭지 못하는 환경에 있어서도 그 정보 입력을 번거롭지 않으면서도 정확하게 할 수 있도록 하는 효과가 있다.In addition, the present invention does not require the user to enter a number of key inputs, and even in an environment where the eyes and hands are not free, such as during driving or driving, the effect of enabling the user to accurately input the information without any inconvenience have.

Claims

In the speech recognition method using multimodal,

Synchronizing between the consonant and the user spoken voice when the user receives a consonant from the user and simultaneously receives the user spoken voice;

Extracting vocabulary starting from the synchronized consonant from a pronunciation dictionary;

Grouping the extracted vocabularies and generating them as a speech recognition target dictionary; And

In the state of generating the voice recognition target dictionary, retrieving a corresponding vocabulary from the generated voice recognition target dictionary for the synchronized user spoken voice and recognizing a word or sentence corresponding to the user spoken voice

Speech recognition method using a multi-modal comprising a.

The method of claim 1,

The consonant is a voice recognition method using multi-modal, characterized in that the first consonant of the first syllable of the word corresponding to the user spoken voice.

The method of claim 1,

In the case of receiving a consonant about a word from the user, the first consonant of each syllable constituting the single word is input from the user,

And a vocabulary composed of syllables beginning with each of the input consonants from the phonetic dictionary to be generated as a speech recognition target dictionary.

The method of claim 1,

In the case of receiving a consonant about a sentence from the user, the consonants on the first syllable of each word constituting the single sentence are input from the user,

Combining words starting with the input consonants, and extracting vocabulary corresponding to each word from a pronunciation dictionary to generate a speech recognition target dictionary;

The method of claim 1,

In the case where a consonant is input from a user through a key button, but at least one consonant is implemented in one key button,

When a specific key button is selected by the user, the vocabulary starting with each of the consonants implemented in the key button is extracted from the phonetic dictionary.

The method of claim 1,

In the case of receiving the first consonants of each syllable forming a word from the user, as a process of generating the speech recognition target dictionary,

Selecting vocabulary corresponding to the first consonant of each syllable input from the user as a speech recognition target vocabulary from a pronunciation dictionary;

Removing words that do not correspond to the context information from the selected speech recognition target vocabulary based on the context information of the word input from the user; And

Generating the speech recognition target vocabulary from which the specific vocabulary has been removed as a speech recognition target dictionary

Speech recognition method using a multi-modal comprising a.

In the speech recognition apparatus using multimodal,

A text processing unit which recognizes a text signal input from a user, processes a consonant corresponding to the text as a text, and transmits the text to a multi-modal synchronization unit;

A voice processor for recognizing a voice signal input from a user and processing the voice signal and transmitting the signal to a multi-modal synchronization unit;

A multi-modal synchronization unit for synchronizing the consonant received from the text processing unit with the user spoken voice received from the voice processing unit;

A recognition DB in which a pronunciation dictionary is stored; And

After extracting the vocabulary starting with the synchronized consonants from the pronunciation dictionary, the extracted vocabularies are combined and generated as a speech recognition target dictionary, and then the corresponding speech is searched from the generated speech recognition target dictionary to synchronize the user. Speech recognition unit for recognizing words or sentences corresponding to voice

Speech recognition apparatus using a multi-modal comprising a.

The method of claim 7, wherein

The speech recognition unit,

The multi-modal synchronization unit performs a speech recognition process using the first consonant and the user's spoken voice of the synchronized words as parameters, and reduces only the vocabulary starting with the first consonant among the vocabulary in the pronunciation dictionary to the speech recognition target. Speech recognition device using modal.

The method according to claim 7 or 8,

And the text signal is input from a user through at least one of a key button, a pen tool, and a virtual keyboard.

In a speech recognition device having a processor,

Synchronizing between the consonant and the user spoken voice when the consonant is input from the user and the user spoken voice is received at the same time;

Extracting vocabulary starting from the synchronized consonants from a pronunciation dictionary;

A function of grouping the extracted vocabularies and generating them as a speech recognition target dictionary; And

A function of recognizing a word or a sentence corresponding to the user spoken voice by searching the corresponding vocabulary from the generated speech recognition target dictionary for the synchronized user spoken voice while generating the voice recognition target dictionary.

A computer-readable recording medium having recorded thereon a program for realizing this.