KR101092489B1

KR101092489B1 - Speech recognition system and method

Info

Publication number: KR101092489B1
Application number: KR1020090126406A
Authority: KR
Inventors: 고우현; 지상훈; 남경태; 이상무; 손웅희
Original assignee: 한국생산기술연구원
Priority date: 2009-12-17
Filing date: 2009-12-17
Publication date: 2011-12-13
Also published as: KR20110069605A

Abstract

본 발명은 음성 인식 시스템 및 방법에 관한 것으로서, 이 시스템은, 사용자를 촬영하여 비전 정보를 생성하고, 사용자로부터 발화된 음성을 인식하여 생성된 인식 단어 정보와 비전 정보에 따라 인식 단어 정보의 채택 여부를 결정하는 영상 기반 음성 인식 장치를 포함한다. 본 발명에 의하면, 사용자를 촬영한 비전 정보와 사용자로부터 발화된 음성을 함께 이용하여 사용자의 발화 단어를 인식함으로써 오인식이 발생할 확률이 낮아져 음성 인식 시스템 성능을 향상시킬 수 있다.The present invention relates to a speech recognition system and method, wherein the system generates a vision information by photographing a user, and whether the recognition word information is adopted according to the recognition word information and the vision information generated by recognizing the speech spoken by the user. It includes an image-based speech recognition device for determining the. According to the present invention, by using the vision information of the user and the speech spoken by the user together, the user's spoken word is recognized to reduce the probability of misrecognition, thereby improving the performance of the speech recognition system.

음성 인식, 비전 정보, 입술 모양, 특징 요소, 음소 그룹 Speech Recognition, Vision Information, Shape, Lip Shape, Phoneme Group

Description

Speech Recognition System and Method {SPEECH RECOGNITION SYSTEM AND METHOD}

본 발명은 음성 인식 시스템 및 방법에 관한 것이다.The present invention relates to a speech recognition system and method.

급격히 발전해 가는 로봇 기술은 인간의 삶의 질을 향상 시켜 준다. 정교한 제어 기술을 이용한 산업용 로봇은 인간에게 물질적 풍요를 가져다 주었고, 이는 점차 서비스 로봇 기술로 발전되어 청소 로봇이나 안내 로봇 등을 통해 인간의 일상적인 삶 속에서 편리함을 가져다 주었다. 생산성 향상을 위해 빠른 속도와 정교한 움직임이 가능하도록 제어하는 기술과 동기화 기술이 산업용 로봇에 중요한 이슈인 반면, 서비스 로봇의 경우 사람에게 적절한 서비스를 제공하기 위한 인간-로봇 대화(Human-Robot Interaction, HRI) 기술이 중요한 이슈이다.Rapidly developing robotic technology improves the quality of human life. Industrial robots using sophisticated control technology brought material abundance to human beings, which gradually developed into service robot technology, bringing convenience in daily life of humans through cleaning robots and guide robots. While technology for controlling speed and precise movement to increase productivity and synchronization technology are important issues for industrial robots, human-robot interaction (HRI) for service robots to provide appropriate services to humans Technology is an important issue.

로봇은 사람의 지시를 받을 수 있는 다양한 센서를 보유하고 있다. 버튼, 키보드, 터치 스크린 등과 같은 외부 입력 장치를 이용해 직접적인 명령을 전달 받을 수 있다. 하지만 이러한 하드웨어의 입력 정보만을 이용하여 사람으로부터 명령을 받을 경우 명령의 확장성을 얻기 어렵다. 따라서 사람과의 대화를 통한 음성 정보를 이용하여 명령을 전달하는 기술이 개발되어 왔다. 음성 정보를 이용하는 경우 다양한 명령어를 인식하여 실행할 수 있으나, 음성 신호는 환경적인 요인에 따른 영향을 많이 받으므로 음성 노이즈에 의한 음성 신호의 변질로 인하여 오인식이 발생할 확률이 높다.The robot has a variety of sensors that can receive human instructions. Direct commands can be delivered using external input devices such as buttons, keyboards, and touch screens. However, it is difficult to obtain command extensibility when receiving a command from a human using only the input information of such hardware. Therefore, a technology for transmitting a command using voice information through a conversation with a person has been developed. In the case of using the voice information, various commands can be recognized and executed. However, since the voice signal is affected by environmental factors, there is a high probability of false recognition due to the deterioration of the voice signal due to voice noise.

본 발명이 해결하고자 하는 과제는 사용자로부터 발화된 음성에 대하여 오인식률을 낮출 수 있는 음성 인식 시스템 및 방법을 제공하는 것이다.The problem to be solved by the present invention is to provide a speech recognition system and method that can lower the false recognition rate for the speech spoken by the user.

이러한 과제를 해결하기 위한 본 발명의 한 태양에 따른 음성 인식 시스템은, 사용자를 촬영하여 비전 정보를 생성하고, 상기 사용자로부터 발화된 음성을 인식하여 생성된 인식 단어 정보와 상기 비전 정보에 따라 상기 인식 단어 정보의 채택 여부를 결정하는 영상 기반 음성 인식 장치를 포함한다.According to an aspect of the present invention, there is provided a speech recognition system for generating vision information by photographing a user, and recognizing the speech according to the recognition word information and the vision information generated by recognizing speech spoken by the user. It includes an image-based speech recognition device for determining whether to adopt the word information.

상기 사용자로부터 발화된 음성을 인식하여 상기 인식 단어 정보를 생성하고 상시 인식 단어 정보를 상기 영상 기반 음성 인식 장치에 제공하는 음성 기반 음성 인식 장치를 더 포함할 수 있다.The apparatus may further include a speech-based speech recognition apparatus configured to recognize the speech spoken by the user to generate the recognition word information and to provide the recognition speech information to the image-based speech recognition apparatus.

상기 영상 기반 음성 인식 장치는 상기 비전 정보로부터 상기 사용자가 발화한 모음을 추출하여 음소 그룹별로 분류할 수 있다.The image-based speech recognition apparatus may classify the vowels spoken by the user from the vision information and classify them by phoneme group.

상기 음소 그룹은 각 모음의 특징 요소의 유사도에 따라 나뉠 수 있다.The phoneme groups may be divided according to the similarity of the feature elements of each vowel.

상기 영상 기반 음성 인식 장치는 상기 인식 단어 정보에서 제1 음소 순서 정보를 추출하고, 상기 비전 정보에서 제2 음소 순서 정보를 추출하며, 상기 제1 및 제2 음소 순서 정보에 따라 상기 인식 단어 정보의 채택 여부를 결정할 수 있다.The image-based speech recognition apparatus extracts first phoneme order information from the recognized word information, extracts second phoneme order information from the vision information, and extracts the phoneme order information from the recognized word information according to the first and second phoneme order information. Adopt a decision.

상기 영상 기반 음성 인식 장치는, 상기 인식 단어 정보의 아스키 코드를 분 석하여 음절별 모음을 추출하고, 상기 추출된 모음에 대응하는 음소 그룹을 결정하며, 상기 결정된 음소 그룹에 따라 상기 제1 음소 순서 정보를 생성할 수 있다.The image-based speech recognition apparatus analyzes an ASCII code of the recognized word information to extract a vowel for each syllable, determines a phoneme group corresponding to the extracted vowel, and determines the first phoneme order according to the determined phoneme group. Information can be generated.

상기 영상 기반 음성 인식 장치는, 상기 비전 정보를 기초로 입술 모양의 특징 요소의 값을 계산하고, 상기 계산된 특징 요소 값에 대응하는 음소 그룹에 따라 상기 제2 음소 순서 정보를 생성할 수 있다.The image-based speech recognition apparatus may calculate a value of a lip-shaped feature element based on the vision information, and generate the second phoneme order information according to a phoneme group corresponding to the calculated feature element value.

상기 특징 요소는 입술 폭, 입술 높이, 상기 입술 폭과 상기 입술 높이의 비율 및 입술 면적 중 적어도 하나를 포함할 수 있다.The feature element may include at least one of a lip width, a lip height, a ratio of the lip width and the lip height, and a lip area.

상기 영상 기반 음성 인식 장치는, 상기 특징 요소에 기초한 특징 요소 공간에 각 음소 그룹을 구분하여 두고 최대 우도 값을 이용하여 상기 계산된 특징 요소 값에 대응하는 음소 그룹을 추출할 수 있다.The image-based speech recognition apparatus may classify each phoneme group in a feature element space based on the feature element, and extract a phoneme group corresponding to the calculated feature element value using a maximum likelihood value.

본 발명의 다른 태양에 따른 로봇 시스템은 상기한 음성 인식 시스템 중 어느 하나를 포함한다.The robot system according to another aspect of the present invention includes any of the above speech recognition systems.

본 발명의 다른 태양에 따른 대화 시스템은 상기한 음성 인식 시스템 중 어느 하나를 포함한다.A conversation system according to another aspect of the present invention includes any of the speech recognition systems described above.

본 발명의 다른 태양에 따른 음성 인식 방법은, 사용자를 촬영하여 비전 정보를 생성하는 단계, 그리고 상기 사용자로부터 발화된 음성 신호를 인식하여 생성된 인식 단어 정보와 상기 비전 정보에 따라 상기 인식 단어 정보의 채택 여부를 결정하는 단계를 포함한다.According to another aspect of the present invention, there is provided a speech recognition method, comprising: photographing a user to generate vision information, and recognizing the speech word information generated by recognizing a speech signal uttered by the user and the recognition word information according to the vision information. Determining whether to adopt.

상기 사용자로부터 발화된 음성을 인식하여 상기 인식 단어 정보를 생성하는 단계를 더 포함할 수 있다.The method may further include generating the recognized word information by recognizing the voice spoken by the user.

상기 결정 단계는 상기 비전 정보로부터 상기 사용자가 발화한 모음을 추출하여 음소 그룹별로 분류하는 단계를 포함할 수 있다.The determining may include extracting the vowels spoken by the user from the vision information and classifying the phonemes into phoneme groups.

상기 결정 단계는, 상기 인식 단어 정보에서 제1 음소 순서 정보를 추출하는 단계, 상기 비전 정보에서 제2 음소 순서 정보를 추출하는 단계, 그리고 상기 제1 및 제2 음소 순서 정보에 따라 상기 인식 단어 정보의 채택 여부를 결정하는 단계를 포함할 수 있다.The determining may include extracting first phoneme order information from the recognized word information, extracting second phoneme order information from the vision information, and the recognition word information according to the first and second phoneme order information. Determining whether to adopt may include.

상기 제1 음소 순서 정보 추출 단계는, 상기 인식 단어 정보의 아스키 코드를 분석하여 음절별 모음을 추출하는 단계, 상기 추출된 모음에 대응하는 음소 그룹을 결정하는 단계, 그리고 상기 결정된 음소 그룹에 따라 상기 제1 음소 순서 정보를 생성하는 단계를 포함할 수 있다.The extracting of the first phoneme order information may include extracting a vowel for each syllable by analyzing the ASCII code of the recognized word information, determining a phoneme group corresponding to the extracted vowel, and according to the determined phoneme group. Generating first phoneme order information.

상기 제2 음소 순서 정보 추출 단계는, 상기 비전 정보를 기초로 입술 모양의 특징 요소의 값을 계산하는 단계, 그리고 상기 계산된 특징 요소 값에 대응하는 음소 그룹에 따라 상기 제2 음소 순서 정보를 생성하는 단계를 포함할 수 있다.The extracting of the second phoneme order information may include calculating a value of a lip feature element based on the vision information, and generating the second phoneme order information according to a phoneme group corresponding to the calculated feature element value. It may include the step.

상기 제2 음소 순서 정보 추출 단계는 상기 특징 요소에 기초한 특징 요소 공간에 각 음소 그룹을 구분하여 두고 최대 우도 값을 이용하여 상기 계산된 특징 요소 값에 대응하는 음소 그룹을 추출하는 단계를 더 포함할 수 있다.The extracting of the second phoneme sequence information may further include extracting a phoneme group corresponding to the calculated feature element value using a maximum likelihood value by dividing each phoneme group in a feature element space based on the feature element. Can be.

본 발명의 실시예에 따른 컴퓨터로 읽을 수 있는 매체는 상기한 방법 중 어느 하나를 컴퓨터에 실행시키기 위한 프로그램을 기록한다.A computer-readable medium according to an embodiment of the present invention records a program for causing a computer to execute any one of the above methods.

이와 같이 본 발명에 의하면, 사용자로부터 발화된 음성과 사용자를 촬영하 여 획득한 비전 정보를 함께 이용하여 사용자의 발화 단어를 인식함으로써, 오인식이 발생할 확률이 낮아져 음성 인식 시스템 성능을 향상시킬 수 있다.As described above, according to the present invention, by using the speech spoken by the user and the vision information obtained by photographing the user together, the speech word of the user is recognized, thereby reducing the probability of misrecognition, thereby improving the performance of the speech recognition system.

그러면 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention.

먼저, 도 1을 참고하여 본 발명의 실시예에 따른 음성 인식 시스템에 대하여 상세하게 설명한다.First, a voice recognition system according to an exemplary embodiment of the present invention will be described in detail with reference to FIG. 1.

도 1은 본 발명의 실시예에 따른 음성 인식 시스템을 설명하기 위한 블록도이다.1 is a block diagram illustrating a speech recognition system according to an embodiment of the present invention.

본 발명의 실시예에 따른 음성 인식 시스템(100)은 음성 기반 음성 인식 장치(110) 및 영상 기반 음성 인식 장치(130)를 포함한다.The speech recognition system 100 according to the embodiment of the present invention includes a speech-based speech recognition apparatus 110 and an image-based speech recognition apparatus 130.

음성 기반 음성 인식 장치(110)는 사용자(U)로부터 발화된 음성을 마이크 등과 같은 입력 모듈(도시하지 않음)을 통해 입력 받는다. 음성 기반 음성 인식 장치(110)는 입력된 음성으로부터 인식된 단어에 기초한 인식 단어 정보를 생성하고 이를 영상 기반 음성 인식 장치(130)에 제공한다.The voice-based speech recognition apparatus 110 receives the speech spoken by the user U through an input module (not shown) such as a microphone. The speech-based speech recognition apparatus 110 generates recognition word information based on the recognized words from the input speech and provides the same to the image-based speech recognition apparatus 130.

음성 기반 음성 인식 장치(110)는 입력된 음성과 미리 구축되어 저장된 단어 데이터베이스(도시하지 않음)에 있는 기준 단어들과의 유사도를 추정한다. 입력된 음성의 발음, 주파수 등의 정보(이하, '음성 정보'라 함)는 유사도를 판별하는 주요 요소가 된다. 음성 신호를 모델링하기 위하여 은닉 마코프 모델(Hidden Markov Models, HMM)을 사용할 수 있으나 이에 한정되지 않는다. 은닉 마코프 모델은 음성 인식 구현 시 이용되는 통계적 언어 모델 기법의 하나이다. 단어 또는 음절 등은 미리 많은 음성 정보로 학습된 은닉 마코프 모델로 표현될 수 있다. 한편, 음성 신호를 모델링 하기 전에 음성 정보를 음성 특징 벡터로 변환하고 이를 기초로 음성 신호를 모델링할 수 있다. 이와 같이 음성 특징 벡터를 이용하여 음성 신호를 모델링하면 음성 인식 성능을 보다 향상시킬 수 있다.The speech-based speech recognition apparatus 110 estimates the similarity between the input speech and reference words in a pre-built and stored word database (not shown). Information such as pronunciation and frequency of the input voice (hereinafter referred to as 'voice information') becomes a major factor for determining similarity. Hidden Markov Models (HMM) may be used to model speech signals, but are not limited thereto. The Hidden Markov Model is one of the statistical language model techniques used to implement speech recognition. A word or syllable may be represented by a hidden Markov model learned with a lot of speech information in advance. Meanwhile, before modeling the speech signal, the speech information may be converted into the speech feature vector, and the speech signal may be modeled based on the speech information. As such, when the speech signal is modeled using the speech feature vector, speech recognition performance may be further improved.

음성 기반 음성 인식 장치(110)는 프루닝 알고리즘(pruning algorithm)을 사용하여 입력된 음성 신호와 단어 데이터베이스의 단어와 비교하여 유사도를 추정할 수 있으나 이에 한정되지 않는다. 이와 같은 프루닝 알고리즘을 사용함으로써 단어 데이터베이스의 사이즈가 크더라도 적은 연산으로 음성 인식이 가능하다.The speech-based speech recognition apparatus 110 may estimate similarity by comparing the input speech signal with a word in a word database using a pruning algorithm, but is not limited thereto. By using this pruning algorithm, even if the word database is large, speech recognition can be performed with fewer operations.

음성 기반 음성 인식 장치(110)는 추정된 유사도를 기초로 인식 단어를 추출하고 인식 단어에 기초한 인식 단어 정보를 생성한다. 이때, 음성 기반 음성 인식 장치(110)는 미리 정해진 수준 아래의 유사도를 나타내는 음성 인식 단어는 인식이 실패한 것으로 처리하고, 미리 정해진 수준 이상의 유사도를 나타내는 음성 인식 단어는 인식이 성공한 것으로 하여 인식 단어 정보를 생성한다.The speech-based speech recognition apparatus 110 extracts a recognition word based on the estimated similarity and generates recognition word information based on the recognition word. In this case, the speech-based speech recognition apparatus 110 processes a speech recognition word representing a similarity below a predetermined level as a failure of recognition, and a speech recognition word representing a similarity above a predetermined level is regarded as a successful recognition to recognize the recognition word information. Create

영상 기반 음성 인식 장치(130)는 음성 기반 음성 인식 장치(110)로부터 제공받은 인식 단어 정보와 음성을 발화하는 사용자(U)의 얼굴 부위를 촬영하여 획득한 비전 정보를 기초로 인식 단어 정보의 채택 여부를 결정한다. 채택된 경우, 영상 기반 음성 인식 장치(130)는 인식 단어를 유효 단어로 결정하고, 유효 단어 정보를 로봇 시스템, 대화 시스템 등과 같은 외부 장치(도시하지 않음)에 제공한다. 반면, 채택되지 않은 경우, 영상 기반 음성 인식 장치(130)는 사용자(U)에게 인식 실패 메시지를 제공할 수 있다.The image-based speech recognition apparatus 130 adopts the recognition word information based on the recognition word information provided from the speech-based speech recognition apparatus 110 and vision information obtained by photographing the face portion of the user U who speaks the voice. Determine whether or not. When adopted, the image-based speech recognition apparatus 130 determines the recognized word as a valid word and provides the valid word information to an external device (not shown) such as a robot system or a conversation system. On the other hand, if not adopted, the image-based speech recognition apparatus 130 may provide a recognition failure message to the user U.

한편, 이들 장치(110, 130)를 별개의 독립적인 장치로 도 1에 도시하였으나, 이에 한정되지 않고 실시예에 따라 이들 장치(110, 130)가 일체의 형태로 구현될 수 있다. 또한 이들 장치(110, 130)가 음성 인식 시스템(100)에 포함되어 있는 것으로 설명했으나, 개개의 장치로 나뉘어 독립적으로 구현될 수도 있다.Meanwhile, although these devices 110 and 130 are illustrated in FIG. 1 as separate independent devices, the devices 110 and 130 may be implemented in one form according to an exemplary embodiment. In addition, although these devices 110 and 130 have been described as being included in the speech recognition system 100, they may be divided into individual devices and implemented independently.

그러면, 도 2 내지 도 5를 참고하여 본 발명의 실시예에 따른 영상 기반 음성 인식 장치(130)에 대해 보다 자세하게 설명한다.Next, the image-based speech recognition apparatus 130 according to an exemplary embodiment of the present invention will be described in detail with reference to FIGS. 2 to 5.

도 2는 도 1에 도시한 영상 기반 음성 인식 장치를 보다 자세히 나타낸 블록도이고, 도 3은 본 발명의 실시예에 따른 인식 단어 정보에서 제1 음소 순서 정보를 추출하는 동작을 설명하기 위한 도면이며, 도 4는 본 발명의 실시예에 따른 비전 정보에서 입술 부분을 추출하는 동작을 설명하기 위한 도면이고, 도 5는 본 발명의 실시예에 따른 비전 정보에서 제2 음소 순서 정보를 추출하는 동작을 설명하기 위한 도면이다.FIG. 2 is a block diagram illustrating the image-based speech recognition apparatus of FIG. 1 in more detail. FIG. 3 is a view for explaining an operation of extracting first phoneme order information from recognized word information according to an exemplary embodiment of the present invention. 4 is a view for explaining an operation of extracting a lip part from vision information according to an embodiment of the present invention, and FIG. 5 is an operation for extracting second phoneme order information from vision information according to an embodiment of the present invention. It is a figure for demonstrating.

도 2를 참고하면, 영상 기반 음성 인식 장치(130)는 영상 촬영부(131), 인식 단어 채택 판단부(133) 및 유효 단어 제공부(135)를 포함한다.Referring to FIG. 2, the image-based speech recognition apparatus 130 includes an image photographing unit 131, a recognition word adoption determining unit 133, and a valid word providing unit 135.

영상 촬영부(131)는 카메라 등과 같은 촬영 모듈(도시하지 않음)을 포함하며, 사용자(U)로부터 음성이 발화되는 경우 촬영 모듈을 통해 사용자(U)의 얼굴 부위를 촬영하여 비디오 프레임으로 이루어진 비전 정보를 생성한다.The image capturing unit 131 includes a capturing module (not shown) such as a camera, and when a voice is uttered from the user U, a vision made of a video frame by capturing a face part of the user U through the capturing module. Generate information.

인식 단어 채택 판단부(133)는 음성 기반 음성 인식 장치(110)로부터 제공받 은 인식 단어 정보의 아스키 코드를 분석하여 음절 별 모음을 추출한다. 인식 단어 채택 판단부(133)는 추출된 음절 별 모음을 기초로 음절 별 음소 그룹을 결정하여 제1 음소 순서 정보를 생성한다.The recognition word adoption determiner 133 extracts a collection of syllables by analyzing an ASCII code of the recognition word information provided from the speech-based speech recognition apparatus 110. The recognition word adoption determining unit 133 determines the phoneme group for each syllable based on the extracted syllable-by-syllable group to generate first phoneme sequence information.

한글은 자음 19개와 모음 21개의 조합을 이용하여 음절을 만들어 낸다. 각 음절을 소리내기 위해 사용자(U)는 적절한 혀의 위치와 입 모양을 만들어 발음하게 된다. 자음 발음은 주로 혀의 위치에, 모음 발음은 입 모양에 영향을 받는다. 특히 단모음 'ㅓ', 'ㅓ', 'ㅗ', 'ㅜ', 'ㅡ', 'ㅣ', 'ㅐ', 'ㅔ'와 자음 'ㅁ'은 구별이 잘되는 대표적인 입술 모양을 가지고 있다. 따라서 모든 모음과 입술 소리를 이들 음소를 중심으로 동일한 음소 그룹으로 분류할 수 있다.Hangul creates syllables using a combination of 19 consonants and 21 vowels. In order to sound each syllable, the user U pronounces the proper tongue position and mouth shape. Consonant pronunciation is mainly influenced by the location of the tongue and vowel pronunciation by mouth shape. In particular, the short vowels 'ㅓ', 'ㅓ', 'ㅗ', 'TT', 'ㅡ', 'ㅣ', 'ㅐ', 'ㅔ' and the consonant 'ㅁ' have representative lips. Thus, all vowel and lip sounds can be grouped into the same phoneme group around them.

여기서 각 음소 그룹은 복수의 음소로 이루어질 수 있음며, 동일한 음소 그룹에 속하는 음소들은 주요 특징 요소가 유사하다. 주요 특징 요소로 입술 모양 안쪽 또는 바깥쪽의 폭, 높이, 폭과 높이의 상대적 비율, 면적, 픽셀 값, 위치 등을 들 수 있다. [표 1]과 같이 8개의 중요 모음은 주요 특징 요소의 유사도를 기반으로 4개의 음소 그룹으로 분류될 수 있으나 이에 한정되지 않으며, 8개의 음소 그룹 또는 그 이하의 음소 그룹으로 분류될 수도 있다. [표 1]은 4개의 음소 그룹에 속하는 중요 모음과 각 음소 그룹의 주요 특징 요소의 값을 보여준다.Here, each phoneme group may be composed of a plurality of phonemes. Phonemes belonging to the same phoneme group have similar main features. Key features include the width, height, relative ratio of width to height, area, pixel values, and position inside or outside the lip shape. As shown in [Table 1], eight important vowels may be classified into four phoneme groups based on the similarity of the main feature elements, but are not limited thereto and may be classified into eight phoneme groups or phoneme groups of less than eight. Table 1 shows the important vowels belonging to the four phoneme groups and the values of the main feature elements of each phoneme group.

인식 단어 채택 판단부(133)가 제1 음소 순서 정보를 생성하는 한 예로서, 도 3에 도시한 것처럼, 음성 인식 단어가 '대한민국'인 경우 인식 단어 정보의 아스키 코드를 분석하여 인식 단어의 음절 별 모음 'ㅐ', 'ㅏ', 'ㅣ', 'ㅜ'를 추출하고, 추출된 음절 별 모음을 통해 음절 별 음소 그룹을 결정한다. 예를 들어, 음절 '대'의 아스키 코드 '0xebb4'를 분석하여 모음 'ㅐ'를 추출하고, 추출된 모음 'ㅐ'가 포함된 음소 그룹 '제2 음소 그룹'을 결정한다. 동일한 방식으로 각 모음에 대응하는 음소 그룹을 결정하면, 제1 음소 순서 정보는 제2-제1-제4-제3 음소 그룹이 된다.As an example in which the recognition word adoption determining unit 133 generates the first phoneme order information, as shown in FIG. 3, when the speech recognition word is 'Korea', an ASCII code of the recognition word information is analyzed to analyze the syllables of the recognition word. Star collections 'ㅐ', 'ㅏ', 'ㅣ', 'TT' are extracted and the syllable group is determined by the syllables. For example, the ASCII code '0xebb4' of the syllable 'large' is analyzed to extract the vowel 'ㅐ', and the phoneme group 'second phoneme group' including the extracted vowel 'ㅐ' is determined. If the phoneme group corresponding to each vowel is determined in the same manner, the first phoneme order information becomes the second-second-first-third-phoneme group.

도 2 및 도 4를 참고하면, 인식 단어 채택 판단부(133)는 영상 촬영부(131)에서 생성된 비전 정보를 기초로 입술 모양의 주요 특징 요소를 추출한다. 즉, 인식 단어 채택 판단부(133)는 비전 정보의 각 비디오 프레임에서 사용자(U)의 얼굴 위치를 찾고, 입 영역을 추출하여 관심 영역(region of interest, ROI)으로 선택한다. 일반적으로 입은 사람의 얼굴 전체에서 하단 1/3 영역에 존재한다. 도 4에 도시한 바와 같이, 비전 정보의 각 비디오 프레임에서 관심 영역(ROI)을 추출한다. 여기서 도 4의 (a), (b), (c), (d), (e), (f), (g) 및 (h)는 각각 모음 'ㅏ', 'ㅓ', 'ㅐ', 'ㅔ', 'ㅜ', 'ㅗ', 'ㅡ' 및 'ㅣ'를 사용자(U)가 발음한 경우에 획득한 비디오 프레임을 나타낸다.2 and 4, the recognition word adoption determiner 133 extracts a main feature element of a lip shape based on the vision information generated by the image capturing unit 131. That is, the recognition word adoption determiner 133 finds the face position of the user U in each video frame of the vision information, extracts the mouth region, and selects it as a region of interest (ROI). It is usually found in the lower third of the face of the wearer. As shown in Fig. 4, a region of interest (ROI) is extracted from each video frame of vision information. Wherein (a), (b), (c), (d), (e), (f), (g) and (h) of Figure 4, the vowel 'ㅏ', 'ㅓ', 'ㅐ', It represents a video frame obtained when the user U pronounces ',', 'TT', 'ㅗ', 'ㅡ' and 'ㅣ'.

인식 단어 채택 판단부(133)는 입술 색상(RGB 값)을 임계값으로 하여 관심 영역(ROI)에서 입술 영역(lip segment, LS)을 추출한다. 보통 입술 색상과 비슷한 색상을 가지는 픽셀들은 관심 영역(ROI)에서 산발적으로 존재한다. 따라서 입술 영역(LS)을 추출하기 위해 복수의 픽셀을 미리 선택하고, 관심 영역(ROI)의 각 픽셀은 인접한 픽셀 중에 선택된 픽셀이 있는지 확인하여 선택된 픽셀이 있는 경우 선택된 픽셀과 결합하여 하나의 클러스터링 그룹을 형성한다. 클러스터링 그룹은 인접한 픽셀들과 반복적으로 결합하여 영역이 확장된다. 물론, 클러스터링 그룹간에 결합이 이루어져 하나의 클러스터링 그룹으로 합쳐질 수 있다. 이에 따라 복수의 클러스터링 그룹(Clustered Segments)이 생성되고, 그 중에서 관심 영역(ROI) 중심에 있는 클러스터링 그룹을 입술 영역(LS)으로 선택한다.The recognition word adoption determiner 133 extracts a lip segment LS from the ROI using the lip color (RGB value) as a threshold. Normally, pixels having a color similar to the lip color exist sporadically in the region of interest (ROI). Therefore, in order to extract the lip region LS, a plurality of pixels are selected in advance, and each pixel of the ROI checks whether there is a selected pixel among adjacent pixels, and if there is a selected pixel, combines the selected pixel with one clustering group. To form. The clustering group expands the region by repeatedly combining with adjacent pixels. Of course, a combination may be formed between the clustering groups and merged into one clustering group. As a result, a plurality of clustered segments are generated, and among them, the clustering group located at the center of the ROI is selected as the lip region LS.

인식 단어 채택 판단부(133)는 입술 영역(LS)의 픽셀들을 이용하여 입술 모양의 주요 특징 요소를 추출한다. 각 음소의 주요 특징 요소의 대표 값은 미리 계산되어 음소 데이터베이스에 참조용으로 저장될 수 있으며, 주요 특징들 사이의 정보 단위가 다를 수 있으므로 정규화를 통해 특징 정보들의 규격을 동일하게 맞출 수 있다.The recognition word adoption determining unit 133 extracts the main feature elements of the lip shape using the pixels of the lip area LS. Representative values of the main feature elements of each phoneme may be precomputed and stored in the phoneme database for reference. Since the information units between the main features may be different, the standardization of the feature information may be equalized through normalization.

인식 단어 채택 판단부(133)는 추출된 입술 모양의 주요 특징 요소와 주요 특징 요소 공간을 이용하여 음절 별 음소 그룹을 추출하고, 이를 기초로 제2 음소 순서 정보를 생성한다.The recognition word adoption determination unit 133 extracts a phoneme group for each syllable using the extracted main feature elements of the lip shape and the main feature element space, and generates second phoneme order information based on the syllable group.

여기서 주요 특징 요소 공간은 도 5에 도시한 바와 같이 주요 특징 요소를 축(A1, A2, A3)으로 하고, 각 축(A1, A2, A3)은 서로 직교하도록 이루어져 있다. 주요 특징 요소 공간은 각 음소 그룹별로 미리 계산된 주요 특징 요소를 기초로 제1 내지 제4 음소 그룹 영역(PG1, PG2, PG3, PG4)으로 구분되어 있고, 음소 그룹 영역(PG1, PG2, PG3, PG4) 간의 경계는 그룹간 경계선(BL)을 통해 알 수 있다.As shown in FIG. 5, the main feature element space is a main feature element with axes A1, A2, and A3, and the axes A1, A2, and A3 are orthogonal to each other. The main feature element space is divided into first to fourth phoneme group areas PG1, PG2, PG3, and PG4 based on pre-calculated main feature elements for each phoneme group, and the phoneme group areas PG1, PG2, PG3, The boundary between PG4) can be known through the group boundary line BL.

즉, 인식 단어 채택 판단부(133)는 추출된 주요 특징 요소를 기초로 각 음소 그룹과의 최대 우도(maximum likelihood) 값을 계산하고, 계산된 최대 우도 값을 기초로 가장 큰 유사도를 보이는 그룹을 해당 음절의 음소 그룹으로 선택한다.That is, the recognition word adoption determining unit 133 calculates a maximum likelihood value with each phoneme group based on the extracted main feature elements, and selects a group having the greatest similarity based on the calculated maximum likelihood value. Select the phoneme group of the syllable.

인식 단어 채택 판단부(133)는 제1 및 제2 음소 순서 정보를 기초로 인식 단어의 채택 여부를 결정한다. 즉, 인식 단어 채택 판단부(133)는 제1 음소 순서 정보에 따른 음절 별 음소 그룹과 제2 음소 순서 정보에 따른 음절 별 음소 그룹이 서로 일치하는지 확인한다. 인식 단어 채택 판단부(133)는 순서가 일치하는 경우에는 인식 단어를 채택하고, 일치하지 않는 경우에는 인식 단어를 거절한다.The recognition word adoption determining unit 133 determines whether to adopt the recognition word based on the first and second phoneme order information. That is, the recognition word adoption determining unit 133 checks whether a syllable phoneme group based on the first phoneme order information and a syllable phoneme group based on the second phoneme order information match each other. The recognition word adoption determination unit 133 adopts the recognition word if the order matches, and rejects the recognition word if it does not match.

유효 단어 제공부(135)는 인식 단어 채택 판단부(133)에 의해 인식 단어가 채택된 경우, 인식 단어를 유효 단어로 결정한다. 유효 단어 제공부(135)는 유효 단어에 기초한 유효 단어 정보를 로봇 시스템, 대화 시스템 등과 같은 외부 장치에 제공한다. 반면, 유효 단어 제공부(135)는 인식 단어가 채택되지 않은 경우 사용자(U)에게 인식 실패 메시지를 제공할 수 있다.The valid word providing unit 135 determines the recognized word as a valid word when the recognized word is adopted by the recognized word adopting determination unit 133. The valid word providing unit 135 provides valid word information based on the valid word to an external device such as a robot system or a conversation system. On the other hand, the valid word providing unit 135 may provide a recognition failure message to the user (U) when the recognition word is not adopted.

그러면, 본 발명에 따른 음성 인식 시스템(100)의 성능을 테스트한 결과에 대하여 설명한다.Next, the results of testing the performance of the speech recognition system 100 according to the present invention will be described.

사용자(U)로부터 발화된 단어(이하, '인식 대상 단어'라 함)를 음성만을 통해 인식하는 시스템(이하, '비교 시스템'이라 함)을 이용하여 본 발명의 실시예에 따른 음성과 비전 정보를 함께 이용하여 인식하는 음성 인식 시스템(100)(이하, '본 시스템'이라 함)의 성능 향상 정도를 테스트하였다. 비교 시스템은 상용 제품인 '보이스웨어(Voiceware)'를 사용하였다.Voice and vision information according to an embodiment of the present invention by using a system (hereinafter, referred to as a 'comparison system') that recognizes a word spoken from a user U (hereinafter, referred to as a 'recognition target word') only through voice. Was tested to improve the performance of the speech recognition system 100 (hereinafter, referred to as 'the present system'). The comparison system used a commercial product 'Voiceware'.

음성만을 통해 단어를 인식하는 경우 인식 대상 단어의 길이가 짧을수록 유사 발음 단어로 오인식될 가능성이 높다. 따라서 본 시스템의 성능 테스트 정확도를 위해 인식 대상 단어는 두세 음절로 이루어진 고립어를 이용하였고, 오인식을 범하기 쉬운 유사 음절이 포함된 단어 위주로 인식 대상 단어를 선정하였다.In the case of recognizing a word using only a voice, the shorter the length of the word to be recognized, the more likely it is to be mistaken as a similar pronunciation word. Therefore, we used isolated words consisting of two or three syllables for the performance test accuracy of this system, and the target words were selected based on words with similar syllables that are easy to misunderstand.

각 음절마다 중요 음소가 포함된 100개 정도의 단어를 인식 대상 단어로 하여 비교 시스템을 사용한 경우의 오인식률과 본 시스템을 사용한 경우의 오인식률을 측정한 결과는 [표 2]와 같다.The results of measuring the false recognition rate when the comparison system is used and the false recognition rate when the system is used, using about 100 words including important phonemes for each syllable as words to be recognized are shown in [Table 2].

[표 2]에 나타난 바와 같이, 비교 시스템은 구별 단어 DB와 유사 단어 DB의 단어에 대한 오인식률이 각각 8.2%와 22.9%이나, 본 시스템은 유사 단어 DB의 단어에 대한 오인식률만 12.5%로 본 시스템이 비교 시스템보다 오인식률이 개선됨을 알 수 있다.As shown in [Table 2], the comparison system has misrecognition rates of 8.2% and 22.9% for words in distinguished word DB and similar word DB, respectively. It can be seen that the present system improves the false recognition rate over the comparison system.

지금까지 본 발명의 실시예에 따른 음성 인식 시스템(100)이 한글에 대하여 사용자 발화를 인식하는 것에 대하여 설명하였으나, 영어, 일어, 중국어 등에도 동일하게 적용될 수 있다. 또한 본 발명의 실시예에 따른 음성 인식 시스템(100)은 로봇 시스템, 대화 시스템, 교육 제공 시스템 등 다양한 시스템에 포함될 수 있다.So far, the speech recognition system 100 according to the embodiment of the present invention has been described with respect to recognizing a user's speech in Korean, but may be equally applicable to English, Japanese, Chinese, and the like. In addition, the voice recognition system 100 according to the embodiment of the present invention may be included in various systems such as a robot system, a dialogue system, and an education providing system.

그러면, 도 6을 참고하여 본 발명의 실시예에 따른 음성 인식 방법에 대해 설명한다.Next, a voice recognition method according to an embodiment of the present invention will be described with reference to FIG. 6.

도 6은 본 발명의 실시예에 따른 음성 인식 방법을 설명하기 위한 흐름도이다.6 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention.

먼저, 음성 기반 음성 인식 장치(110)는 사용자(U)로부터 발화된 음성을 입력 모듈을 통해 입력 받고, 입력된 음성으로부터 인식된 단어에 기초한 인식 단어 정보를 생성한다(S210). 즉, 음성 기반 음성 인식 장치(110)는 은닉 마코프 모델 등을 사용하여 음성 신호를 모델링해 두고 입력된 음성 신호에 대하여 프루닝 알고리즘 등을 사용하여 단어 데이터베이스에 있는 단어들과의 유사도를 추정하고, 추정된 유사도를 기초로 추출된 인식 단어에 기초한 인식 단어 정보를 생성한다.First, the voice-based speech recognition apparatus 110 receives a spoken speech from the user U through an input module, and generates recognition word information based on words recognized from the input speech (S210). That is, the speech-based speech recognition apparatus 110 models a speech signal using a hidden Markov model and the like, estimates the similarity with words in a word database using a pruning algorithm or the like on the input speech signal, Recognition word information is generated based on the recognition word extracted based on the estimated similarity.

그리고 영상 기반 음성 인식 장치(130)는 음성을 발화하는 사용자(U)의 얼굴 부위를 촬영 모듈을 통해 촬영하여 비디오 프레임으로 이루어진 비전 정보를 생성한다(S220).In operation S220, the image-based speech recognition apparatus 130 generates a vision information formed of a video frame by photographing a face portion of the user U who utters a voice through a photographing module.

이후, 영상 기반 음성 인식 장치(130)는 인식 단어 정보에서 제1 음소 순서 정보를 추출한다(S230). 즉, 영상 기반 음성 인식 장치(130)는 인식 단어 정보의 아스키 코드를 분석하여 음절 별 모음을 추출한다. 영상 기반 음성 인식 장치(130)는 추출된 음절 별 모음을 기초로 음절 별 음소 그룹을 결정하여 제1 음소 순서 정보를 생성한다.Thereafter, the image-based speech recognition apparatus 130 extracts first phoneme order information from the recognized word information (S230). That is, the image-based speech recognition apparatus 130 extracts a collection of syllables by analyzing ASCII codes of recognized word information. The image-based speech recognition apparatus 130 generates first phoneme sequence information by determining a syllable group for each syllable based on the extracted syllable-by-syllable collection.

그리고 영상 기반 음성 인식 장치(130)는 비전 정보에서 제2 음소 순서 정보를 추출한다(S240). 즉, 영상 기반 음성 인식 장치(130)는 비전 정보의 각 비디오 프레임에서 사용자(U)의 얼굴 위치를 찾고, 입 영역을 추출하여 관심 영역(ROI)으로 선택한다. 영상 기반 음성 인식 장치(130)는 입술 색상(RGB 값)을 임계값으로 하여 관심 영역(ROI)에서 입술 영역(LS)을 추출한다. 영상 기반 음성 인식 장치(130)는 입술 영역(LS)의 픽셀들을 이용하여 입술 모양의 주요 특징 요소를 추출한다. 영상 기반 음성 인식 장치(130)는 추출된 주요 특징 요소와 주요 특징 요소 공간을 이용하여 음절 별 음소 그룹을 추출하고, 이를 기초로 제2 음소 순서 정보를 생성한다. 즉, 영상 기반 음성 인식 장치(130)는 추출된 주요 특징 요소를 기초로 각 음소 그룹과의 최대 우도 값을 계산하고, 계산된 최대 우도 값을 기초로 가장 큰 유사도를 보이는 그룹을 해당 음절의 음소 그룹으로 선택한다.The image-based speech recognition apparatus 130 extracts second phoneme sequence information from vision information (S240). That is, the image-based speech recognition apparatus 130 finds the face position of the user U in each video frame of the vision information, extracts the mouth region, and selects it as the ROI. The image-based speech recognition apparatus 130 extracts the lip region LS from the ROI based on the lip color (RGB value) as a threshold. The image-based speech recognition apparatus 130 extracts main feature elements of the lip shape by using pixels of the lip area LS. The image-based speech recognition apparatus 130 extracts a phoneme group for each syllable using the extracted main feature elements and the main feature element space, and generates second phoneme order information based on the extracted syllable group. That is, the image-based speech recognition apparatus 130 calculates a maximum likelihood value with each phoneme group based on the extracted main feature elements, and selects a group having the greatest similarity based on the calculated maximum likelihood value. Select as a group.

영상 기반 음성 인식 장치(130)는 제1 및 제2 음소 순서 정보를 기초로 인식 단어 정보의 채택 여부를 결정한다(S250). 즉, 영상 기반 음성 인식 장치(130)는 제1 음소 순서 정보에 따른 음절 별 음소 그룹과 제2 음소 순서 정보에 따른 음절 별 음소 그룹이 서로 일치하는지 확인한다. 영상 기반 음성 인식 장치(130)는 순서가 일치하는 경우에는 인식 단어를 채택하고, 일치하지 않는 경우에는 인식 단어를 거절한다.The image-based speech recognition apparatus 130 determines whether to adopt recognition word information based on the first and second phoneme order information (S250). That is, the image-based speech recognition apparatus 130 checks whether a syllable phoneme group based on the first phoneme order information and a syllable phoneme group based on the second phoneme order information match each other. The video-based speech recognition apparatus 130 adopts a recognized word if the order matches, and rejects the recognized word if the order does not match.

영상 기반 음성 인식 장치(130)는 인식 단어가 채택된 경우 인식 단어를 유효 단어로 결정하고, 유효 단어에 기초한 유효 단어 정보를 로봇 시스템, 대화 시스템 등과 같은 외부 장치에 제공한다(S260).When the recognition word is adopted, the image-based speech recognition apparatus 130 determines the recognition word as a valid word, and provides valid word information based on the valid word to an external device such as a robot system or a conversation system (S260).

한편, 단계(S210) 후에 단계(S220)가 수행되는 것으로 설명하였으나, 이에 한정되지 않고 실시예에 따라 이들 단계(S210, S220)가 동시에 수행될 수 있으며, 단계(S220)가 단계(S210)보다 먼저 수행될 수도 있다. 또한 단계(S230) 후에 단계(S240)가 수행되는 것으로 설명하였으나, 이에 한정되지 않고 실시예에 따라 이들 단계(S230, S240)가 동시에 수행될 수 있고, 단계(S240)가 단계(S230)보다 먼저 수행될 수도 있다.On the other hand, it has been described that step S220 is performed after step S210, but not limited thereto, and these steps S210 and S220 may be simultaneously performed according to an embodiment, and step S220 may be performed in step S210. It may be performed first. In addition, it has been described that the step S240 is performed after the step S230, but the present invention is not limited thereto, and according to an exemplary embodiment, the steps S230 and S240 may be simultaneously performed, and the step S240 may be performed before the step S230. It may also be performed.

본 발명의 실시예는 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터로 읽을 수 있는 매체를 포함한다. 이 매체는 지금까지 설명한 음성 인식 방법을 실행시키기 위한 프로그램을 기록한다. 이 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 이러한 매체의 예에는 하드디스크, 플로피디스크 및 자기 테이프와 같은 자기 매체, CD 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(Floptical Disk)와 자기-광 매체, 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 구성된 하드웨어 장치 등이 있다. 또는 이러한 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the invention include a computer readable medium containing program instructions for performing various computer-implemented operations. This medium records a program for executing the speech recognition method described so far. The media may include, alone or in combination with the program instructions, data files, data structures, and the like. Examples of such media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CDs and DVDs, floppy disks and program commands such as magnetic-optical media, ROM, RAM and flash memory. Hardware devices configured to store and perform such operations. Alternatively, the medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상에서 본 발명의 바람직한 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the preferred embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

도 2는 도 1에 도시한 영상 기반 음성 인식 장치를 보다 자세히 나타낸 블록도이다.FIG. 2 is a block diagram illustrating the image-based speech recognition apparatus of FIG. 1 in more detail.

도 3은 본 발명의 실시예에 따른 인식 단어 정보에서 제1 음소 순서 정보를 추출하는 동작을 설명하기 위한 도면이다.3 is a view for explaining an operation of extracting first phoneme order information from recognized word information according to an exemplary embodiment of the present invention.

도 4는 본 발명의 실시예에 따른 비전 정보에서 입술 부분을 추출하는 동작을 설명하기 위한 도면이다.4 is a diagram for describing an operation of extracting a lip part from vision information according to an exemplary embodiment of the present invention.

도 5는 본 발명의 실시예에 따른 비전 정보에서 제2 음소 순서 정보를 추출하는 동작을 설명하기 위한 도면이다.5 is a diagram for describing an operation of extracting second phoneme order information from vision information according to an exemplary embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100: 음성 인식 시스템, 110: 음성 기반 음성 인식 장치,100: speech recognition system, 110: speech-based speech recognition device,

130: 영상 기반 음성 인식 장치, 131: 영상 촬영부,130: an image-based speech recognition device, 131: an image capturing unit,

133: 인식 단어 채택 판단부, 135: 유효 단어 제공부133: recognition word adoption determination unit, 135: valid word provider

Claims

An image-based speech recognition apparatus for generating vision information by photographing a user and determining whether to recognize the recognition word information according to the recognition word information and the vision information generated by recognizing the speech spoken by the user.

Including,

The image-based speech recognition apparatus extracts first phoneme order information from the recognized word information, extracts second phoneme order information from the vision information, and extracts the phoneme order information from the recognized word information according to the first and second phoneme order information. Decide whether to accept,

The image-based speech recognition apparatus analyzes an ASCII code of the recognized word information to extract a vowel for each syllable, determines a phoneme group corresponding to the extracted vowel, and determines the phoneme order information according to the determined phoneme group. To generate

Speech recognition system.

In claim 1,

And a speech-based speech recognition apparatus for generating the recognition word information by recognizing the speech spoken by the user and providing the recognition word information to the image-based speech recognition apparatus.

In claim 1,

And the image-based speech recognition apparatus extracts the vowels spoken by the user from the vision information and classifies them by phoneme group.

4. The method of claim 3,

The phoneme group is divided according to the similarity of the feature elements of each vowel.

delete

In claim 1,

The image-based speech recognition apparatus is configured to calculate a value of a lip-shaped feature element based on the vision information, and generate the second phoneme sequence information according to a phoneme group corresponding to the calculated feature element value. .

8. The method of claim 7,

The feature element comprises at least one of lip width, lip height, ratio of lip width to lip height, and lip area.

8. The method of claim 7,

And a phoneme group corresponding to the calculated feature element value using a maximum likelihood value by dividing each phoneme group in a feature element space based on the feature element.

10. A robotic system comprising the speech recognition system of any of claims 1-4 and 7-9.

10. A conversation system comprising the speech recognition system of any of claims 1-4 and 7-9.

Photographing the user to generate vision information, and

Determining whether to adopt the recognition word information according to the recognition word information and the vision information generated by recognizing the speech signal spoken by the user.

Including,

The determining step,

Extracting first phoneme order information from the recognized word information;

Extracting second phoneme order information from the vision information, and

Determining whether to adopt the recognized word information according to the first and second phoneme order information;

The first phoneme order information extraction step,

Extracting a vowel per syllable by analyzing the ASCII code of the recognized word information;

Determining a phoneme group corresponding to the extracted vowels, and

Generating the first phoneme order information according to the determined phoneme group.

Speech recognition method.

delete

The method of claim 12,

The determining may include extracting a vowel spoken by the user from the vision information and classifying the vowels by phoneme group.

The method of claim 14,

delete

The method of claim 12,

The second phoneme order information extraction step,

Calculating a value of a lip feature element based on the vision information, and

Generating the second phoneme order information according to a phoneme group corresponding to the calculated feature element value.

The method of claim 18,

The feature element comprises at least one of a lip width, a lip height, a ratio of the lip width to the lip height, and a lip area.

The method of claim 18,

The extracting of the second phoneme order information further comprises: dividing each phoneme group into a feature element space based on the feature element and extracting a phoneme group corresponding to the calculated feature element value using a maximum likelihood value. Speech recognition method.

A computer readable medium having recorded thereon a program for causing a computer to execute the method of any one of claims 12, 14, 15, and 18-20.