KR20060064493A

KR20060064493A - Voice interface system and method

Info

Publication number: KR20060064493A
Application number: KR1020050069038A
Authority: KR
Inventors: 김상훈; 이영직
Original assignee: 한국전자통신연구원
Priority date: 2004-12-08
Filing date: 2005-07-28
Publication date: 2006-06-13
Also published as: KR100622019B1

Abstract

본 발명은 음성 인터페이스 시스템 및 방법에 관한 발명으로써, 특히 지능형 로봇 등의 응용에 사용될 수 있으며, 자연스러운 음성 커뮤니케이션을 가능하게 하고, 음성 인식 성능을 향상시킨 음성 인터페이스 시스템 및 방법에 관한 발명이다. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice interface system and method, and more particularly, to an audio robot system and method that can be used in applications such as an intelligent robot, and enables natural voice communication and improves voice recognition performance.

본 발명은 음성 데이터를 이용하여 음성 인식을 수행하며, 수행된 음성 인식의 오류 여부를 판단하는 음성 인식 모듈; 및 상기 음성 인식 모듈에서 음성 인식의 오류가 있는 것으로 판단되어지는 경우 휴먼 오퍼레이터를 통하여 음성 인식 결과를 얻는 H/O 오류 후처리 모듈을 포함하는 음성 인터페이스 서버를 제공한다. The present invention provides a speech recognition module that performs speech recognition using speech data and determines whether an error of the speech recognition is performed; And an H / O error post-processing module for obtaining a speech recognition result through a human operator when it is determined that the speech recognition module has an error in speech recognition.

Description

Voice interface system and method {VOICE INTERFACE SYSTEM AND METHOD}

도 1은 본 발명의 실시예에 의한 음성 인터페이스 시스템을 설명하기 위한 도면이다. 1 is a view for explaining a voice interface system according to an embodiment of the present invention.

도 2는 본 발명의 실시예에 의한 음성 인터페이스 시스템의 신호 처리 흐름을 나타내기 위한 도면이다. 2 is a diagram illustrating a signal processing flow of a voice interface system according to an embodiment of the present invention.

도 3은 음성 인식 결과가 정(正)인식 되었을 때와 오(誤)인식 되었을 때로 정보처리 과정을 나타내는 도면이다. 3 is a diagram illustrating an information processing process when a voice recognition result is positively recognized and when it is incorrectly recognized.

도 4는 도 1에 표현된 음성 인터페이스 시스템 등에서 수행될 수 있는 음성 인터페이스 방법을 설명하기 위한 도면이다. FIG. 4 is a diagram for describing a voice interface method that may be performed in the voice interface system illustrated in FIG. 1.

도 5는 도 4에 표현된 음성 인터페이스 방법에서 H/O 오류 후처리 단계의 일례를 나타내는 도면이다. FIG. 5 is a diagram illustrating an example of an H / O error post-processing step in the voice interface method illustrated in FIG. 4.

도 6은 도 4에 표현된 음성 인터페이스 방법에 있어서 대화 모델 단계의 일례를 나타내는 도면이다. FIG. 6 is a diagram illustrating an example of a dialogue model step in the voice interface method illustrated in FIG. 4.

음성인식은 음성으로 가전기기나 단말기를 제어하거나 원하는 정보를 음성으로 접근할 수 있는 매우 편리한 기능으로 최근 지능형로봇, 텔레매틱스, 홈네트워크 등에 응용하고자 하는 사례가 증가하고 있다. 특히 지능형로봇의 경우, 키보드나 마우스 등의 인터페이스가 매우 곤란하므로, 음성인식, 영상인식(제스처, 문자인식), 센서(초음파, 적외선) 등의 인터페이스가 효과적인 방법으로 알려져 있는데, 그 중 특히 음성인식은 사용자에게 가장 자연스러운 인터페이스로 알려져 있다. Voice recognition is a very convenient function that can control home appliances or terminals with voice or access desired information by voice. Recently, there is an increasing number of applications for intelligent robots, telematics, home networks, and the like. In particular, in the case of intelligent robots, since interfaces such as a keyboard and a mouse are very difficult, interfaces such as voice recognition, video recognition (gesture, text recognition), and sensors (ultrasound, infrared) are known as effective methods. Is known as the most natural interface for the user.

그러나, 종래 기술에 음성 인터페이스는 100개 미만의 간단한 음성명령어를 인식하여 수행하는 기능이 주류를 이루었고, 인식/합성엔진도 자립형(stand-alone)으로 로봇에 내장되어 있어 CPU, 메모리 등 리소스 제약으로 인한 대화형 음성인터페이스가 어려웠다. 명령어도 주로 로봇의 구동명령나 메뉴선정을 위한 명령어로 구성되어 로봇을 이용한 응용서비스에 한계가 있었다. 또한 종래기술이 음성인식오류 및 사용자오류에 대한 대처방안이 미흡하여 음성인터페이스 시스템의 사용이 오히려 불편함을 초래했다. However, in the prior art, the voice interface has a main function of recognizing and performing less than 100 simple voice commands, and the recognition / synthesis engine is also stand-alone and embedded in the robot, thereby limiting resources such as CPU and memory. Interactive voice interface was difficult. Commands were mainly composed of robot driving commands or commands for menu selection, so there was a limit to application services using robots. In addition, the conventional technology is insufficient to cope with the voice recognition error and user error caused the use of the voice interface system rather uncomfortable.

따라서, 본 발명이 이루고자 하는 기술적 과제는 상기한 문제점들을 해결하기 위한 것으로서, 음성인식의 성능뿐만 아니라 음성인식오류(Recognition error) 대처, 사용자오류(Human error) 대처 방법, 실시간성(Realtime) 및 사용자편의성(Usability)이 고려됨으로써, 지능형로봇이 실생활에 쓰일 수 있도록 인간-로봇 상호작용이 가능한 음성인터페이스 방법 및 장치를 제안하고자 한다.Therefore, the technical problem to be achieved by the present invention is to solve the above problems, as well as the performance of speech recognition error recognition (Recognition error), the human error (Human error) coping method (Realtime) and the user Considering the usability, the present invention proposes a voice interface method and apparatus capable of human-robot interaction so that an intelligent robot can be used in real life.

상술한 목적을 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면은 음성 데이터를 이용하여 음성 인식을 수행하며, 수행된 음성 인식의 오류 여부를 판단하는 음성 인식 모듈; 및 상기 음성 인식 모듈에서 음성 인식의 오류가 있는 것으로 판단되어지는 경우 휴먼 오퍼레이터를 통하여 음성 인식 결과를 얻는 H/O 오류 후처리 모듈을 포함하는 음성 인터페이스 서버를 제공한다. As a technical means for achieving the above object, a first aspect of the present invention comprises a speech recognition module for performing speech recognition using speech data, and determines whether the performed speech recognition error; And an H / O error post-processing module for obtaining a speech recognition result through a human operator when it is determined that the speech recognition module has an error in speech recognition.

본 발명의 제 2 측면은 사용자가 발화한 음성을 음성 데이터로 변환하고, 변환된 음성 데이터를 통신을 통하여 음성 인터페이스 서버로 전달하는 음성 인터페이스 클라이언트; 및 상기 음성 인터페이스 클라이언트로부터 전달된 음성 데이터를 이용하여 음성 인식을 수행하되, 음성 인식에 오류가 클 것으로 판단되어지는 경우에는 휴먼 오퍼레이터를 통하여 음성 인식 결과를 얻는 음성 인터페이스 서버를 포함하는 음성 인터페이스 시스템을 제공한다. According to a second aspect of the present invention, a voice interface client converts a voice spoken by a user into voice data and transmits the converted voice data to a voice interface server through communication; And a voice interface server that performs voice recognition using voice data transmitted from the voice interface client, and obtains a voice recognition result through a human operator when it is determined that an error in voice recognition is large. to provide.

본 발명의 제 3 측면은 음성 데이터를 이용하여 음성 인식을 수행하는 음성 인식 모듈; 상기 음성 인식 모듈에서 수행된 음성 인식 결과 오인식이거나, 의미상 의 오류가 있는 경우에 오류를 수정하기 위한 질문인 시스템 응답을 형성하는 대화 모델 모듈; 및 상기 질문을 음성 데이터로 변환하는 음성 합성 모듈을 포함하는 음성 인터페이스 서버를 제공한다. A third aspect of the invention provides a speech recognition module for performing speech recognition using speech data; A dialogue model module for forming a system response which is a misrecognition result of a speech recognition result performed in the speech recognition module or a question for correcting an error when there is a semantic error; And a speech synthesis module for converting the question into speech data.

본 발명의 제 4 측면은 사용자가 발화한 음성을 음성 데이터로 변환하고, 변환된 음성 데이터를 통신을 통하여 음성 인터페이스 서버로 전달하는 음성 인터페이스 클라이언트; 및 상기 음성 인터페이스 클라이언트로부터 전달된 음성 데이터를 이용하여 음성 인식을 수행하되, 음성 인식 결과 오인식이거나, 의미상의 오류가 있는 경우에는 오류를 수정하기 위한 질문인 시스템 응답을 형성하는 음성 인터페이스 서버를 포함하는 음성 인터페이스 시스템을 제공한다. A fourth aspect of the present invention provides a voice interface client for converting a voice spoken by a user into voice data and transferring the converted voice data to a voice interface server through communication; And a voice interface server configured to perform voice recognition using voice data transmitted from the voice interface client, and to form a system response that is a question for correcting an error if a voice recognition result is misrecognized or there is a semantic error. Provide a voice interface system.

본 발명의 제 5 측면은 (a) 음성 데이터를 이용하여 음성 인식을 수행하며, 수행된 음성 인식의 오류 여부를 판단하는 음성 인식 단계; 및 (b) 상기 (a) 단계에서 음성 인식의 오류가 있는 것으로 판단되어지는 경우 휴먼 오퍼레이터를 통하여 음성 인식 결과를 얻는 H/O 오류 후처리 단계를 포함하는 음성 인식 방법을 제공한다. A fifth aspect of the present invention is a voice recognition step of performing a speech recognition using the speech data, and determining whether the performed speech recognition error; And (b) an H / O error post-processing step of obtaining a speech recognition result through a human operator when it is determined that there is an error in speech recognition in step (a).

본 발명의 제 6 측면은 (a) 음성 데이터를 이용하여 음성 인식을 수행하는 음성 인식 단계; (b) 상기 (a) 단계에서 얻어진 음성 인식 결과에 오인식이 있거나 의미상의 오류가 있는 경우에 오류를 수정하기 위한 질문인 시스템 응답을 형성하는 단계; 및 (c) 상기 시스템 응답을 음성 데이터로 변환하는 단계를 더 포함하는 음성 인식 방법을 제공한다. A sixth aspect of the present invention includes the steps of (a) performing a speech recognition using speech data; (b) forming a system response which is a question for correcting an error when there is a mistake in the speech recognition result obtained in step (a) or a semantic error; And (c) converting the system response into voice data.

이하, 첨부한 도면들을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 그러나, 본 발명의 실시예들은 여러가지 형태로 변형될 수 있으며, 본 발명의 범위가 아래에서 상술하는 실시예들로 인하여 한정되는 식으로 해석되어 져서는 안된다. 본 발명의 실시예들은 당업계에서 평균적 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해 제공되는 것이다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, embodiments of the present invention may be modified in various forms, and the scope of the present invention should not be construed as being limited by the embodiments described below. Embodiments of the present invention are provided to more fully explain the present invention to those skilled in the art.

도 1을 참조하면, 음성 인터페이스 시스템은 음성 인터페이스 서버(10) 및 음성 인터페이스 클라이언트(20a, 20b, 20c)를 포함한다. Referring to FIG. 1, the voice interface system includes a voice interface server 10 and voice interface clients 20a, 20b, and 20c.

음성 인터페이스 클라이언트(20a, 20b, 20c)는 사용자가 발화한 음성을 음성 데이터로 변환하고, 변환된 음성 데이터를 통신을 통하여 음성 인터페이스 서버(10)로 전달하는 기능을 수행한다. 음성 인터페이스 클라이언트(20a, 20b, 20c)는 지능형 로봇 등의 로봇이 될 수 있으며, 무선랜 등의 무선 통신 또는 유선 통신을 통하여 음성 인터페이스 서버(10)와 통신을 수행한다. 음성 인터페이스 클라이언트(20a, 20b, 20c)는 음성 구간의 시작점과 끝점을 구분하는 끝점 검출 기능을 가질 수 있다. 이 경우, 음성 인터페이스 클라이언트(20a, 20b, 20c)는 묵음구간과 음성구간을 구분하고, 음성구간에 해당하는 음성 데이터를 음성 인터페이스 서버(10)로 전달한다. The voice interface clients 20a, 20b, and 20c convert the voice spoken by the user into voice data and transmit the converted voice data to the voice interface server 10 through communication. The voice interface clients 20a, 20b, and 20c may be robots, such as intelligent robots, and communicate with the voice interface server 10 through wireless or wired communication such as a wireless LAN. The voice interface clients 20a, 20b, and 20c may have an endpoint detection function for distinguishing a start point and an end point of a voice section. In this case, the voice interface clients 20a, 20b, and 20c distinguish between the silent section and the voice section, and transmit the voice data corresponding to the voice section to the voice interface server 10.

음성 인터페이스 서버(10)는 음성 인터페이스 클라이언트(20a, 20b, 20c)로 부터 전달된 음성 데이터를 이용하여 음성 인식을 수행한다. 음성 인터페이스 서버(10)는 음성 인식 모듈(11)을 포함하며, H/O(human operator) 오류 후처리 모듈(12), 대화 모델 모듈(13) 및 음성 합성 모듈(14)을 포함할 수 있다. 또한, 음성 인터페이스 서버(10)는 서버 관리 모듈(15)을 추가적으로 구비할 수 있다. 음성 인터페이스 서버(10)를 구성하는 각각의 모듈은 별개의 서버 또는 하드웨어로 구성될 수도 있다. 또한, 음성 인터페이스 서버(10)를 구성하는 각각의 모듈은 하나의 서버 또는 하드웨어에서 수행되는 별개의 프로그램의 형태로 구현될 수도 있다.The voice interface server 10 performs voice recognition using voice data transferred from the voice interface clients 20a, 20b, and 20c. The voice interface server 10 includes a speech recognition module 11 and may include a human operator error post-processing module 12, a dialogue model module 13, and a speech synthesis module 14. . In addition, the voice interface server 10 may further include a server management module 15. Each module constituting the voice interface server 10 may be configured as a separate server or hardware. In addition, each module constituting the voice interface server 10 may be implemented in the form of a separate program executed in one server or hardware.

음성 인식 모듈(11)은 음성 인터페이스 클라이언트(20a, 20b, 20c)로부터 수신된 음성 데이터를 이용하여 음성 인식을 수행한다. 음성 인터페이스 서버(10)가 H/O 오류 후처리 모듈(12)을 포함하는 경우에, 음성 인식 모듈(11)은 수행된 음성 인식의 결과에 오류가 있는지 여부를 판단하여, 오류가 있는 경우, H/O 오류 후처리 모듈(12)로 그 결과를 통보할 수 있다. The speech recognition module 11 performs speech recognition using the speech data received from the speech interface clients 20a, 20b, and 20c. When the voice interface server 10 includes the H / O error post-processing module 12, the voice recognition module 11 determines whether there is an error in the result of the performed voice recognition, and if there is an error, The result can be notified to the H / O error post-processing module 12.

H/O(human-operator) 오류 후처리 모듈(12)은 상기 음성 인식 모듈(11)에서 음성 인식의 오류가 있는 것으로 판단되어지는 경우 휴먼 오퍼레이터를 통하여 음성 인식 결과를 얻는 기능을 수행한다. 보다 구체적으로, 음성 인식 모듈(11)에서 음성 인식의 오류가 있는 것으로 판단되어지는 경우 사람이 직접 음성을 청취하여 정확한 음성 인식 결과를 입력하는 등의 방법으로 오류를 치유한다. H/O 오류 후처리 모듈(12)은 사용자별 음성 인식 오류의 누적 회수를 디스플레이 하는 기능을 가짐으로써, 누적된 거절회수가 많은 사용자의 오류를 우선적으로 수정할 수 있도록 하여, 사용자 불만을 최소화 할 수 있다. 또한, H/O 오류 후처리 모듈(12)은 자주 틀리는 단어를 디스플레이 함으로써, 휴먼 오프레이터가 쉽게 정인식결과를 선택하게 하여, 효율적인 수정을 가능하게 할 수 있다. 또한, H/O 오류 후처리 모듈(12)은 오류가 발생한 단어와 가장 인접한 것으로 판단되어지는 적어도 하나의 단어를 디스플레이 함으로써, 휴먼 오프레이터가 디스플레이된 단어 중에서 정인식 결과를 빨리 찾을 수 있도록 하여, 효율적인 수정을 가능하게 할 수 있다. 또한, H/O 오류 후처리 모듈(12)은 대화 히스토리를 디스플레이 함으로써, 휴먼 오퍼레이터가 좀 더 정확하고 효율적으로 정인식 결과를 선택하도록 할 수 있다. 또한, H/O 오류 후처리 모듈(12)은 단어 자동 인덱싱 기능을 가짐으로써, 몇 개의 음소만 타이핑이 되었을 때, 일치되는 단어를 리스트하여 나머지 음소를 타이핑하지 않더라도 쉽게 정인식 단어를 선정하도록 하여, 효율적인 수정을 가능하게 할 수 있다. 또한, H/O 오류 후처리 모듈(12)은 발화 속도 가변 기능을 가짐으로써, 빠른 속도로 음성을 청취한 후 정인식된 음성 인식 결과를 기록하게 하여, H/O 오류 후처리 속도를 개선할 수도 있다. The human-operator error post-processing module 12 performs a function of obtaining a speech recognition result through a human operator when it is determined that the speech recognition module 11 has an error in speech recognition. More specifically, when it is determined that there is an error of speech recognition in the speech recognition module 11, a person heals the error by directly listening to the voice and inputting an accurate speech recognition result. The H / O error post-processing module 12 has a function of displaying the cumulative number of speech recognition errors for each user, so that the error of a user having a large number of accumulated rejections can be corrected first, thereby minimizing user complaints. have. In addition, the H / O error post-processing module 12 may display the wrong word frequently, allowing the human operator to easily select the positive recognition result, thereby enabling efficient correction. In addition, the H / O error post-processing module 12 displays at least one word that is determined to be closest to the word in which the error occurs, thereby enabling the human operator to quickly find a recognition result among the displayed words, thereby efficiently Modifications can be made possible. In addition, the H / O error post-processing module 12 may display the conversation history, allowing the human operator to select the recognition result more accurately and efficiently. In addition, the H / O error post-processing module 12 has an automatic word indexing function, so that when only a few phonemes are typed, a list of matching words is selected to easily select a recognized word without typing the remaining phonemes. Efficient modifications can be made possible. In addition, the H / O error post-processing module 12 may have a function of varying the speech rate so as to record the recognized speech recognition result after listening to the voice at a high speed, thereby improving the H / O error post-processing speed. have.

대화 모델 모듈(13)은 음성 인식 모듈(11) 또는 H/O 오류 후처리 모듈(12)에서 얻어진 음성 인식 결과에 의미상의 오류가 있는 경우에 오류를 수정하기 위한 시스템 응답을 형성하는 기능을 수행한다. 의미상의 오류의 일례로서, "[날짜] + [날씨]"가 의미 관계상 오류가 없는 경우라고 가정하자. 이 경우, 단순히 "날씨"라는 음성 인식 결과를 얻으면, 음성 인식에 의미상의 오류가 있는 것으로서, 어느 날짜의 날씨를 묻는 것인지를 사용자에게 질문할 필요가 있다. 또한, "아버지 + 날씨"라는 음성 인식 결과를 얻으면, 역시 음성 인식에 의미상의 오류가 있는 것으로 서, 어느 날짜의 날씨를 묻는 것인지를 사용자에게 질문할 필요가 있다. 이와 같이, 대화 모델 모듈(13)은 의미상의 오류가 있는 경우에 오류를 수정하기 위한 시스템 응답을 형성함으로써, 사용자와 음성 인터페이스 간 대화형 상호 작용을 원활히 하는 효과가 있다. The conversation model module 13 performs a function of forming a system response to correct an error when there is a semantic error in the speech recognition result obtained by the speech recognition module 11 or the H / O error post-processing module 12. do. As an example of semantic error, assume that "[date] + [weather]" is a case where there is no error in semantic relation. In this case, if a voice recognition result of simply "weather" is obtained, there is a semantic error in voice recognition, and it is necessary to ask the user which date the weather is asked. In addition, if a voice recognition result of "father + weather" is obtained, there is also a semantic error in speech recognition, and it is necessary to ask the user which date the weather is asked. As described above, the dialogue model module 13 forms a system response for correcting an error when there is a semantic error, thereby facilitating interactive interaction between the user and the voice interface.

음성 합성 모듈(14)은 대화 모델 모듈(13)에서 출력된 시스템 응답을 음성 데이터로 변환하여 음성 인터페이스 클라이언트(20a, 20b, 20c)로 전달하는 기능을 수행한다. The voice synthesis module 14 converts the system response output from the dialogue model module 13 into voice data and transmits the voice response to the voice interface clients 20a, 20b, and 20c.

서버 관리 모듈(15)은 음성 인식 모듈(11), H/O 오류 후처리 모듈(12), 대화 모델 모듈(13) 및 음성 합성 모듈(14) 각각이 독립된 서버의 형태로 구현되는 경우 사용될 수 있는 모듈로서, 부하분산을 통해 실시간 처리를 가능하게 할 수 있다. The server management module 15 may be used when the speech recognition module 11, the H / O error post-processing module 12, the dialogue model module 13, and the speech synthesis module 14 are each implemented in the form of an independent server. As one module, load balancing can enable real-time processing.

각 가정마다 로봇이 사용될 경우, 음성 인터페이스 클라이언트(20a, 20b, 20c)는 각 가정 내에서 여러 대가 될 수 있으며, 각 가정에서는 무선랜 등의 통신을 통해 음성 인터페이스 서버(10)로 정보를 요청하고, 음성 인터페이스 서버(10)는 음성 인터페이스 클라이언트(20a, 20b, 20c)로부터 전달된 음성 데이터에 따라 정보처리된 결과를 제공한다. 이는 사용자로 하여금 음성 인터페이스 클라이언트(20a, 20b, 20c)의 저가 구입을 가능하게 하고, 음성 인터페이스 서버(10)를 통해 각종 정보처리를 담당하게 함으로서 실시간으로 서비스를 제공할 수 있는 구조이다. 음성 인터페이스 서버(10) 및 음성 인터페이스 클라이언트(20a, 20b, 20c)간 정보전달은 바람직하게 패킷을 사용한다. When a robot is used in each home, the voice interface clients 20a, 20b, and 20c may be multiple units in each home, and each home requests information to the voice interface server 10 through communication such as wireless LAN. The voice interface server 10 provides the result of the information processing according to the voice data transmitted from the voice interface clients 20a, 20b, and 20c. This allows the user to purchase low cost voice interface clients 20a, 20b, and 20c, and to provide services in real time by in charge of various information processing through voice interface server 10. FIG. Information transfer between voice interface server 10 and voice interface clients 20a, 20b, and 20c preferably uses packets.

도 2는 본 발명의 실시예에 의한 음성 인터페이스 시스템의 신호 처리 흐름을 나타내기 위한 도면이고, 도 3은 음성 인식 결과가 정(正)인식 되었을 때와 오(誤)인식 되었을 때로 정보처리 과정을 나타내는 도면이다. 2 is a diagram illustrating a signal processing flow of a voice interface system according to an exemplary embodiment of the present invention, and FIG. 3 illustrates an information processing process when a voice recognition result is positively recognized and when a false recognition is performed. It is a figure which shows.

도 2 및 3을 참조하면, 음성 인식 결과가 정(正)인식 되었을 때의 정보 처리 과정은 사용자(30)가 "오늘 일정이 뭐지" 등의 음성 명령을 발화하는 단계(S11), 음성 인터페이스 클라이언트(20)가 사용자(30)가 발성한 음성 데이터 중 음성 구간을 검출한 후 검출된 음성 데이터를 전달하는 단계(S12), 음성 인식 모듈(11)이 전달된 음성 데이터를 이용하여 "오늘"과 "일정"이라는 정(正)인식된 음성 인식을 수행하는 단계(S13), 대화 모델 모듈(13)이 음성 인식된 결과에 따라 "누구의 일정입니까?" 등의 시스템 응답을 형성하는 단계(S14), 음성 합성 모듈(14)이 시스템 응답을 음성 데이터로 변환하는 단계(S15), 및 음성 인터페이스 클라이언트(20)가 음성 데이터로 변환된 시스템 응답에 따라 사용자에게 발화하는 단계(S16)을 포함한다. Referring to FIGS. 2 and 3, when the voice recognition result is positively recognized, the information processing process includes the steps of the user 30 uttering a voice command such as "what is the schedule today" (S11), and the voice interface client. 20, after detecting the voice section among the voice data uttered by the user 30 and delivering the detected voice data (S12), the voice recognition module 11 transmits the “today” using the transmitted voice data. In step S13, the dialogue model module 13 performs the voice recognized result of "voice" based on the voice recognition result. "Who is it?" Forming a system response (S14), the speech synthesis module 14 converts the system response into voice data (S15), and the voice interface client 20 converts the user into voice data according to the system response. The step (S16) to ignite.

또한, 음성 인식 결과가 오(誤)인식 되었을 때의 정보 처리 과정은 사용자(30)가 "오늘 일정이 뭐지" 등의 음성 명령을 발화하는 단계(S21), 음성 인터페이스 클라이언트(20)가 사용자(30)가 발성한 음성 데이터 중 음성 구간을 검출한 음성 데이터를 전달하는 단계(S22), 음성 인식 모듈(11)이 전달된 음성 데이터를 이용하여 오(誤)인식으로 판단되어지는 음성 인식을 수행하는 단계(S23), H/O 오류 후처리 모듈(12)에서 휴먼 오퍼레이터에 의하여 오류가 보정되어 "오늘"과 "일정"이라는 음성 인식 결과를 형성하는 단계(S24), 대화 모델 모듈(13)이 음성 인식된 결과에 따라 "누구의 일정입니까?" 등의 시스템 응답을 형성하는 단계(S25), 음성 합성 모듈(14)이 시스템 응답을 음성 데이터로 변환하는 단계(S26) 및 음성 인터페이스 클라이언트(20)가 음성 데이터로 변환된 시스템 응답에 따라 사용자에게 발화하는 단계(S27)를 포함한다. In addition, the information processing process when the voice recognition result is recognized is a step (S21) in which the user 30 utters a voice command such as "what is the schedule today" (S21), the voice interface client 20 is the user ( In step S22, the voice data detected by the voice section of the voice data uttered by the voiced voice is transmitted (S22), and the voice recognition module 11 performs the voice recognition that is judged as a false recognition using the transferred voice data. In step S23, the error is corrected by the human operator in the H / O error post-processing module 12 to form a voice recognition result of "today" and "schedule" (S24), and the dialogue model module 13 Based on this speech recognized result, "Who's schedule?" Forming a system response (S25), the speech synthesis module 14 converts the system response into voice data (S26), and the voice interface client 20 to the user according to the system response converted into voice data. Ignite step (S27).

도 4를 참조하면, 음성 인터페이스 방법은 음성 향상 단계(S31), 음성 끝점 검출 단계(S32), 음성/비음성 검증 단계(S33) 음성 특징 추출 단계(S34), 실시간 잡음보상 단계(S35), 핵심어 탐색 단계(S36), 온라인 화자적응 단계(S37), 발화 검증 단계(S38), H/O 오류 후처리 단계(S39), 대화 모델 단계(S40) 및 음성 합성 단계(S41)를 포함한다. 상기 단계들 중 음성 향상 단계(S31) 및 음성 끝점 검출 단계(S32)는 음성 인터페이스 클라이언트에서 수행될 수 있으며, 나머지 단계들은 음성 인터페이스 서버에서 수행될 수 있다. 만일 음성 끝점 검출 단계(S32)가 2단계로 구성되는 경우에는, 2단계 중 제 1 단계는 음성 인터페이스 클라이언트에서 수행되고, 제 2 단계는 음성 인터페이스 서버에서 수행될 수도 있다. 음성 향상 단계(S31), 음성 끝점 검출 단계(S32), 음성/비음성 검증 단계(S33) 음성 특징 추출 단계(S34), 실시간 잡음보상 단계(S35), 핵심어 탐색 단계(S36), 온라인 화자적응 단계(S37) 및 발화 검증 단계(S38)를 편의상 음성 인식 단계(S42)로 호칭될 수 있다. 음성/비음성 검증 단계(S33) 음성 특징 추출 단계(S34), 실시간 잡음보상 단계 (S35), 핵심어 탐색 단계(S36), 온라인 화자적응 단계(S37) 및 발화 검증 단계(S38)는 음성 인식 모듈에서 수행될 수 있으며, H/O 오류 후처리 단계(S39)는 H/O 오류 후처리 모듈에서, 대화 모델 단계(S40)는 대화 모델 모듈에서, 음성 합성 단계(S41)는 음성 합성 모듈에서 수행될 수 있다. Referring to FIG. 4, the voice interface method includes a voice enhancement step S31, a voice endpoint detection step S32, a voice / non-voice verification step S33, a voice feature extraction step S34, a real time noise compensation step S35, Keywords search step (S36), online speaker adaptation step (S37), speech verification step (S38), H / O error post-processing step (S39), dialogue model step (S40) and speech synthesis step (S41). Among the above steps, the voice enhancement step S31 and the voice endpoint detection step S32 may be performed in the voice interface client, and the remaining steps may be performed in the voice interface server. If the voice endpoint detection step S32 is composed of two steps, the first step of the two steps may be performed in the voice interface client, and the second step may be performed in the voice interface server. Voice enhancement step (S31), voice endpoint detection step (S32), voice / non-voice verification step (S33) voice feature extraction step (S34), real-time noise compensation step (S35), keyword search step (S36), online speaker adaptation Step S37 and speech verification step S38 may be referred to as speech recognition step S42 for convenience. Speech / non-speech verification step (S33) Speech feature extraction step (S34), real-time noise compensation step (S35), keyword search step (S36), online speaker adaptation step (S37) and speech verification step (S38) is a speech recognition module The H / O error post-processing step (S39) is performed in the H / O error post-processing module, the dialogue model step (S40) is performed in the dialog model module, and the speech synthesis step (S41) is performed in the speech synthesis module. Can be.

음성 향상(speech enhancement) 단계(S31)는 주로 정적(stationary) 배경잡음을 제거하여 음성의 명료도를 배경잡음에 비해 상대적으로 높여주는 단계로서 어레이 신호 프로세싱(Array signal processing) 및 위너필터(Wiener filter) 기능 등을 수행한다. Speech enhancement step (S31) mainly removes stationary background noise to increase the intelligibility of speech relative to background noise. Array signal processing and Wiener filter are performed. Function and so on.

음성끝점검출 단계(S32)은 묵음구간과 음성구간을 구분하는 단계로, 일례로 2단계의 음성 끝점 검출이 수행될 수 있다. 2단계의 음성 끝점 검출 단계는 음성의 에너지 정보를 이용하여 1차 음성끝점검출을 수행하는 제 1 단계와 통계적 모델로 제 1 단계 결과로부터 GSAP(Global speech absent probability)를 이용하여 좀 더 정교하게 음성끝점을 검출하는 제 2 단계를 포함할 수 있다. 2단계의 음성 끝점 검출 단계 중 제 1 단계는 음성 인터페이스 클라이언트에서 수행되고, 제 2 단계는 음성 인터페이스 서버에서 수행하게 될 수 있다. The voice endpoint detection step S32 is a step of distinguishing between the silent section and the voice section. For example, two steps of voice endpoint detection may be performed. In the second stage of speech endpoint detection, the first stage of performing the first speech endpoint detection using the energy information of the speech and the statistical model are more precisely using the global speech absent probability (GSAP) from the result of the first stage. And a second step of detecting the endpoint. The first step of the two voice endpoint detection steps may be performed by the voice interface client, and the second step may be performed by the voice interface server.

음성/비음성 검증 단계(S33)에서, 끝점검출된 음성은 GMM (Gaussian Mixture Model)기반 음성/비음성 검증방법 등을 통해 음성인지 잡음인지 검증과정을 거치게 되고, 단순 잡음으로 판단되면 이후의 단계를 수행하지 아니하고, 음성으로 판단되면 이후의 단계를 수행하게 된다.In the voice / non-voice verification step (S33), the endpoint-detected voice is subjected to a voice recognition noise recognition process through a GMM (Gaussian Mixture Model) -based voice / non-voice verification method, etc. If it is determined that the voice is not performed, the subsequent steps are performed.

음성 특징 추출 단계(S34)에서, 음성으로부터 특징 파라메터(예: 필터뱅크, 멜켑스트럼 등)가 추출된다.In the voice feature extraction step S34, feature parameters (eg, a filter bank, a meltstring, etc.) are extracted from the voice.

잡음보상 단계(S35)에서 음성구간에 대해 비정적(non-stationary) 배경잡음을 IMM(Interactive Multiple Model)방법으로 실시간으로 제거하게 된다. 잡음이 제거된 최종 특징 파라메터는 HMM(Hidden Markov Model) 음향모델로부터 확률값을 계산하는데 이용되고, 이로부터 인식대상 어휘후보 단어간 확률값을 비교하여 인식결과를 출력하게 된다. In the noise compensation step S35, non-stationary background noise for the speech section is removed in real time using the IMM (Interactive Multiple Model) method. The final feature parameter without noise is used to calculate the probability value from the HMM (Hidden Markov Model) acoustic model. From this, the result of the recognition is compared by comparing the probability value between the lexical candidate words.

핵심어 탐색 단계(S36)에서는, 인식대상 어휘가 많으면 (예: 1000단어급 이상) 인식시 걸리는 시간이 증가하게 되므로, 이를 실시간으로 인식결과를 출력할 수 있도록 트리 검색(tree search) 등을 이용한 고속탐색방법을 사용한다. 음성 명령어 발화시 고립단어 뿐만 아니라 단문도 발성이 가능하고, 이 단문내에 포함된 핵심어를 추출하여 핵심어만 인식하는 방식으로, 고립단어 발성에 비해 사용자 발화시 편의성을 대폭 높일 수 있다. In the key word search step (S36), when the recognition target vocabulary is large (for example, 1000 words or more), the time required for recognition increases, so that a high speed using a tree search or the like can be output in real time. Use the search method. When speech commands are spoken, not only isolated words but also short words can be spoken, and the key words included in the short words are extracted to recognize only the key words, so that the user's speech can be greatly increased compared to isolated words.

온라인 화자적응 단계(S37)에서, 온라인 화자적응을 통해 기존 모델링된 화자독립 음성특성에 발화자의 음성특성을 실시간으로 반영하여 화자독립 음향모델링으로 인한 인식성능 저하를 막는다. In the online speaker adaptation step (S37), through the online speaker adaptation, the speaker's speech characteristics are reflected in real time to the speaker-independent speech characteristics that have been previously modeled, thereby preventing the degradation of the recognition performance due to the speaker-independent acoustic modeling.

발화 검증 단계(S38)에서, 음성인식 결과가 정인식인지 오인식인지 검증한다. 일반적으로 오인식이 발생할 경우, 오인식된 결과를 그대로 시스템 응답에 사용한다면 사용자의 만족도를 크게 저하시키고, 음성인식 기능이 오히려 편리함 보다는 불편함을 초래할 수 있다. 이와 같은 음성인식 오류로 인한 사용자 불만을 해소하고자 음성인식 결과를 다시 한번 검증하여 확실히 정인식된 결과라고 신뢰할 경우에만 시스템 응답으로 전달하고 그 외는 사용자에게 다시 한번 발성하게 하는 거절기능이 사용자 편의성을 위해 매우 중요하다. 발화 검증 단계(S38)는 각종 LLR(Log Likelihood Ratio)값으로부터 추출된 스코어 값(예: Anti-model LLR score, N-best LLR score, LLR score의 조합, word duration)을 이용하여 검증하는 제 1 단계와, 인식수행 모듈단계별 출력되는 중간결과값 및 메타데이타(예: SNR, 성별, 나이, 음절수, 음운구조, 피치, 발성속도, 사투리 등)를 이용하여 발화검증의 신뢰도를 높이는 제 2 단계를 포함할 수 있다. 음성 인터페이스 서버는 발화검증된 최종 결과에 따라 다음 단계인 대화모델/휴먼오퍼레이터로 진행할지 아니면 사용자에게 재발성을 요구할지 결정하게 된다.In the speech verification step (S38), it is verified whether the speech recognition result is correct or incorrect. In general, when misrecognition occurs, if the misrecognized result is used in the system response as it is, the user's satisfaction is greatly reduced, and the voice recognition function may cause inconvenience rather than convenience. In order to solve the user's dissatisfaction due to the voice recognition error, the voice recognition result is verified once again, and the rejection function that delivers it to the system response only when it is believed to be the recognized result and the other voice is made to the user once again is very convenient for user convenience. It is important. The utterance verification step S38 may be performed by using a score value extracted from various Log Likelihood Ratio (LLR) values (eg, anti-model LLR score, N-best LLR score, combination of LLR score, and word duration). A second step of increasing the reliability of speech verification using the intermediate results and metadata (e.g., SNR, gender, age, syllable number, phonological structure, pitch, speech rate, dialect, etc.) output for each recognition module step. It may include. The voice interface server decides whether to proceed to the next step, dialogue model / human operator, or to request recurrence from the user, based on the final result of speech verification.

H/O 오류 후처리 단계(S39)는 발화 검증 단계(S38)에서 음성 인식이 오인식으로 판단된 경우 수행되는 단계로서, 사람인 휴먼 오퍼레이터가 인식 오류를 정인식 결과로 수정하는 단계이다. The H / O error post-processing step S39 is a step performed when speech recognition is misidentified in the speech verification step S38, and a human operator who is a human corrects the recognition error to a positive recognition result.

대화 모델 단계(S40)는 음성인식 결과를 음성인식 모듈로부터 직접 받거나, H/O 오류후처리 단계(S39)로부터 수정된 음성인식 결과를 입력받아, 의미오류 후처리 과정을 통해 의미적으로 오류(예: "오늘 아버지 일정"은 의미적 오류가 발생하지 아니한 경우이나, "날씨 아버지 일정"은 의미적 오류가 발생한 경우이다.)가 있는지 검증하고, 빠진 의미단어(핵심어)를 재차 발성하도록 시스템 응답을 생성하는 단계이다. The dialogue model step (S40) receives the speech recognition result directly from the speech recognition module, or receives the modified speech recognition result from the H / O error post-processing step (S39). Example: "Daily Father's Day" means no semantic error, or "Weather Father's Day" means semantic error.), And responds to the missing semantic word (key word) again. Step to generate.

음성 합성 단계(S41)는 시스템 응답에 따라 음성 데이터를 형성한다. 이때, 시스템 응답 문장으로부터 양태를 분석하여 화자의 의도에 따른 대화체 스타일의 합성음으로 변환하여 들려줄 수 있다. Speech synthesis step S41 forms speech data in accordance with the system response. In this case, the system may analyze the aspect from the system response sentence and convert it into a dialogue style synthesized sound according to the speaker's intention.

도 5는 도 4에 표현된 음성 인터페이스 방법에서 H/O 오류 후처리 단계의 일례를 나타내는 도면이다. H/O 오류 후처리 단계에서는 여러대의 음성 인터페이스 클라이언트로부터 입력되는 복수개의 오인식결과를 빠른 시간내에 다수 사용자에게 응답할 수 있도록 처리하는 것이 매우 중요하다. 이에 본 발명에서는 휴먼오퍼레이터가 오인식결과를 정인식결과로 효율적으로 수정할 수 있는 방식을 제안한다. FIG. 5 is a diagram illustrating an example of an H / O error post-processing step in the voice interface method illustrated in FIG. 4. In the H / O error post-processing step, it is very important to process a plurality of false recognition results inputted from multiple voice interface clients so as to respond to a large number of users quickly. Accordingly, the present invention proposes a method in which a human operator can efficiently correct a misrecognition result as a positive recognition result.

H/O 오류 후처리 단계는 거절회수 디스플레이 단계(S51)를 포함 할 수 있다. 거절회수 디스플레이 단계(S51)는 발화검증(Utterance verification) 단계에서 거절(Rejection)한 회수를 DB(41)에 누적하여, 누적된 거절회수가 많은 사용자의 오류를 우선적으로 수정할 수 있도록 함으로써, 사용자 불만을 최소화하기 위한 단계이다.The H / O error post-processing step may include a rejection count display step (S51). Rejection count display step (S51) accumulates the number of rejection (Rejection) in the Utterance verification step in the DB 41, so that the user errors by accumulating the cumulative number of rejections is preferentially corrected, user complaints Steps to minimize this.

H/O 오류 후처리 단계는 자주 틀리는 단어 디스플레이 단계(S52)를 포함할 수 있다. 이 단계는 자주 틀리는 단어를 DB(42)에 등록하고 이를 디스플레이 함으로써, 운용자가 쉽게 정인식결과를 선택하게 함으로써 효율적인 수정을 가능하게 한다. The H / O error post-processing step may include a frequently incorrect word display step S52. This step registers frequently wrong words in the DB 42 and displays them, thereby enabling efficient modification by allowing the operator to easily select the correct recognition result.

H/O 오류 후처리 단계는 베스트 인식결과 디스플레이 단계(S53)를 포함 할 수 있다. 이 단계에서, 오인식 된 단어의 인식 결과와 가장 가까운 복수의 단어를 디스플레이 함으로써, 디스플레이된 단어 중에서 정인식 결과를 빨리 찾을 수 있도록 한다. The H / O error post-processing step may include displaying a best recognition result (S53). In this step, by displaying a plurality of words closest to the recognition result of the misrecognized word, it is possible to quickly find the correct recognition result among the displayed words.

H/O 오류 후처리 단계는 대화 히스토리 디스플레이 단계(S54)를 포함 할 수 있다. 이 단계에서, 사용자와 음성 인터페이스 시스템 간 대화가 진행된 로그를 디스플레이함으로써, 운용자가 좀더 정확하게 정인식 결과를 선택할 수 있도록 한다. The H / O error post-processing step may include a conversation history display step S54. In this step, the log of the conversation between the user and the voice interface system is displayed, allowing the operator to select the recognition result more accurately.

H/O 오류 후처리 단계는 단어 자동 인텍싱 단계(S55)를 포함 할 수 있다. 이 단계에서는 정인식 단어를 빨리 찾을 수 있도록 몇 개의 음소만 타이핑이 되었을 때, 일치되는 단어를 리스트하여 나머지 음소를 타이핑하지 않더라도 정인식 단어를 바로 선정하도록 한다. The H / O error post-processing step may include a word automatic indexing step (S55). In this step, when only a few phonemes are typed so that the correct word can be found quickly, a list of matching words is selected so that the correct word can be selected immediately without typing the remaining phonemes.

H/O 오류 후처리 단계는 발화 속도 가변 단계(S56)를 포함 할 수 있다. 음성을 청취하는 시간은 음성의 길이에 비례하게 되는데 이에 따라 사용자에게 H/O 오류 후처리 단계를 통해 정인식 결과를 응답할 수 있는 시간이 길어지게 되므로, 발화 속도 가변 단계(S56)에서는 음성의 청취가 가능한 범위까지 음성의 길이를 대폭 줄여, 즉 발화 속도를 증가시켜 H/O 오류 후처리 속도를 개선하고자 한다. The H / O error post-processing step may include a variable speech rate step (S56). Since the time for listening to the voice is proportional to the length of the voice, the time for the user to respond to the positive recognition result through the H / O error post-processing step becomes longer, so that the speech can be heard in the variable speech rate step S56. We want to improve the H / O error post-processing speed by significantly reducing the length of speech to the extent possible.

대화 모델 단계는 의미오류 후처리 단계(S61), 탐색 대화영역 제한 단계(S62) 및 응답용 대화문장 생성 단계(S63)를 포함한다. The conversation model step includes a semantic error post-processing step S61, a search conversation area limit step S62, and a response dialog sentence generation step S63.

의미 오류 후처리 단계(S61)에서, 음성 인식된 결과에 의미적으로 오류가 있는지 검증하고, 오류가 있는 경우 빠진 의미 단어(핵심어)를 재차 발성하도록 사용자에게 요청한다. 이때 이러한 의미적 애매성이 있을 경우, DB(51)에 저장된 표 1 과 같은 의미적관계 규칙 테이블을 이용하여, 정해진 규칙 이외의 형태가 입력될 경우, 가장 유사한 형태로 대화를 진행하도록 한다. In the semantic error post-processing step (S61), it is verified whether there is a semantically error in the speech-recognized result, and if there is an error, the user is requested to speak again the missing semantic word (key word). At this time, when there is such semantic ambiguity, when a form other than a predetermined rule is input using the semantic relation rule table as shown in Table 1, the dialogue is performed in the most similar form.

대화 모델 단계에서 생성하는 대화체 문장에 따라 사용자가 발성하는 어휘가 제한될 것이다. 가령, 대화 모델 단계에서 생성하는 대화체 문장이 시기를 묻는 문장이라면, 그 답은 날짜, 시간 등 일정한 범위 내에 있는 단어가 될 것이다. 따라서, 탐색 대화영역 제한 단계(S62)에서는 상술한 핵심어 탐색 단계 등에서 탐색하여야 할 핵심어의 대상 범위를 줄이도록 하여, 결과적으로 인식률을 향상시키는 기능을 수행한다.The vocabulary spoken by the user will be limited according to the dialogue sentence generated at the dialogue model stage. For example, if the dialogue sentence generated at the dialogue model stage is a sentence asking for timing, the answer will be a word within a certain range such as date and time. Therefore, in the search dialogue area limiting step S62, the target range of the key word to be searched in the above key word search step is reduced, and as a result, the recognition rate is improved.

응답용 대화 모델 생성 단계(S63)에서, 시스템 응답을 생성한다. In response dialogue generation step S63, a system response is generated.

표 2는 사용자, 음성 인터페이스 클라이언트(로봇) 및 음성 인터페이스 서버간 입출력에 대한 동작 상태를 시간에 따라 나타내고 있다. Table 2 shows the operation states of the input / output between the user, the voice interface client (robot), and the voice interface server over time.

초기단계에서는 로봇(클라이언트)와 서버는 대기상태로 있고, 로봇으로부터 입력된 배경잡음을 서버로 보내 실시간으로 환경에 적응하는 단계를 수행한다. 사용자가 "로봇"이라고 원거리에서 부르면 로봇은 어레이 마이크를 통해 화자위치를 추정하고 잡음을 제거한 뒤 음성구간을 검출하여 서버로 보낸다. 서버에서는 화자인식을 수행하여 누가 말하는지 인식하고 화자개인정보를 로딩하여 화자특성에 따른 음향특성을 적응하도록 한다. 로봇은 추정된 화자위치로 방향을 전환하여 사용자와 50cm 거리까지 이동한다. 그리고 서버로부터 음성합성음을 전달받아 사용자에게 "무엇을 도와드릴까요. 홍길동님" 라고 출력하게 된다. 이때 로봇은 영상인식을 통해 화자의 입술을 바라보고 있도록 얼굴 추적(face-tracking)을 수행하고, 영상정보도 음성정보와 함께 멀티모달정보를 추출한다. In the initial stage, the robot (client) and the server are in a standby state, and the background noise input from the robot is sent to the server to adapt to the environment in real time. When the user calls "robot" from a distance, the robot estimates the speaker position through the array microphone, removes the noise, and detects the voice section and sends it to the server. The server recognizes who speaks by performing speaker recognition, and loads speaker's personal information to adapt the acoustic characteristics according to the speaker's characteristics. The robot turns to the estimated speaker position and moves up to 50cm away from the user. And it receives the voice synthesis sound from the server and outputs to the user, "What can I do for you? Hong Gil Dong". At this time, the robot performs face-tracking to look at the speaker's lips through image recognition, and extracts multimodal information along with voice information.

사용자는 정보를 로봇에게 요청하고(예:“오늘 날씨가 어때?”) 로봇은 전과 동일하게 잡음제거, 음성끝점검출을 수행한 후 서버로 음성을 전송한다. 서버로 전송된 음성은 단문내 포함된 핵심어(예: “오늘 날씨”)를 추출하고 음성인식을 수행한다. 이때 합성음이 출력되는 동안에도 음성인식이 가능하도록 바지-인(barge-in) 처리 기능을 포함한다. 음성인식 결과는 온라인 화자적응을 통해 음성인식을 수행하고, 인식결과가 신뢰도가 있는지 발화검증을 통해 재차 검증하고, 발화검증 결과에 따라 대화모델로 직접 입력되거나 H/O 오류후처리 과정으로 넘어가 정인식 결과로 보정하고, 최종 인식결과는 대화모델로 입력된다. 대화모델은 사용자가 요청한 질의에 대해 시스템응답(예: “오늘 대전 날씨는 맑습니다”)을 생성하고 대화체 합성기를 통해 출력하게 된다. 이후 동일한 과정을 통해 사용자와 로봇간 대화형 음성인터페이스가 이루어지며, 최종 종료신호(예: “OK”)가 주어지면 로봇과 서버는 초기상태로 대기하게 된다. The user requests information from the robot (eg, “How is the weather today?”) And the robot sends noise to the server after noise reduction and voice endpoint detection as before. The voice sent to the server extracts key words (eg, “today's weather”) contained in the short text and performs voice recognition. At this time, it includes a barge-in processing function so that speech recognition is possible while the synthesized sound is output. Voice recognition results are performed through online speaker adaptation, voice recognition is performed, and the verification results are verified again through utterance verification. Based on the utterance verification results, they are directly inputted into the dialogue model or the H / O error post-processing process. The result is corrected and the final recognition result is input into the dialogue model. The dialogue model generates a system response (eg, “Weather in Daejeon today”) for the query requested by the user and outputs it through the dialogue synthesizer. After that, an interactive voice interface is established between the user and the robot through the same process. When the final termination signal (eg, “OK”) is given, the robot and the server wait in an initial state.

본 발명에 의한 음성 인터페이스 장치 및 방법은 H/O 오류 후처리를 수행함으로써, 음성인식오류를 최소화할 수 있다는 장점이 있다. Voice interface device and method according to the present invention has the advantage that the voice recognition error can be minimized by performing the H / O error post-processing.

또한, 본 발명에 의한 음성 인터페이스 장치 및 방법은 대화 모델을 사용하여 적절한 시스템 응답을 형성함으로써, 의미적 오류나 음성 인식의 오류가 발생하는 경우 적절한 질문을 사용자에게 제시하고 그 답을 얻어, 음성인식의 오류 또는 사용자 오류에 적절하게 대처할 수 있다는 장점이 있다. In addition, the speech interface device and method according to the present invention forms an appropriate system response using a conversational model, so that when a semantic error or speech recognition error occurs, a proper question is presented to the user and the answer is obtained. The advantage is that it can cope with errors or user errors appropriately.

또한, 본 발명에 의한 음성 인터페이스 장치 및 방법은 대화 모델에서 형성되는 시스템 응답에 따라 음성 인식시 탐색하는 핵심어의 범위를 줄임으로써, 음성 인식의 정확도 및 속도를 개성할 수 있다는 장점이 있다. In addition, the voice interface device and method according to the present invention has the advantage of being able to personalize the accuracy and speed of speech recognition by reducing the range of key words searched for during speech recognition according to the system response formed in the dialogue model.

또한, 본 발명에 의한 음성 인터페이스 장치 및 방법은 H/O 오류 후처리를 수행함에 있어서, 사용자별 음성 인식 오류의 누적 회수, 자주 틀리는 단어, 오류가 발생한 단어와 가장 인접한 것으로 판단되어지는 적어도 하나의 단어, 및 대화 히스토리 중 적어도 하나를 디스플레이하거나, 단어 자동 인덱싱 기능을 가지거나, 발화 속도 가변 기능을 가짐으로써, H/O 오류 후처리의 효율을 향상시킬 수 있다는 장점이 있다. In addition, the apparatus and method for voice interface according to the present invention includes at least one that is determined to be the closest to the cumulative number of user-specific speech recognition errors, frequently incorrect words, and errors occurring in performing H / O error post-processing. By displaying at least one of a word and a conversation history, having a word automatic indexing function, or having a variable speech rate function, the efficiency of H / O error post-processing can be improved.

또한, 본 발명에 의한 음성 인터페이스 장치 및 방법은 클라이언트/서버 구조를 가짐으로써, 클라이언트 특히 로봇 클라이언트의 저가 구입을 가능하게 한다 는 장점이 있다. In addition, the voice interface device and method according to the present invention has the advantage of having a client / server structure, thereby enabling a low-cost purchase of a client, in particular a robot client.

Claims

A voice recognition module which performs voice recognition using voice data and determines whether an error of the voice recognition is performed; And

And a H / O error post-processing module for obtaining a speech recognition result through a human operator when it is determined that there is an error in speech recognition in the speech recognition module.

The method of claim 1,

The H / O error post-processing module is configured to display at least one of a cumulative number of voice recognition errors for each user, a word that is frequently wrong, at least one word that is determined to be closest to an error word, and a conversation history. .

The method of claim 1,

The H / O error post-processing module has a word automatic indexing function.

The method of claim 1,

The H / O error post-processing module has a speech interface variable function.

The method of claim 1,

A dialogue model module for forming a system response which is a question for correcting an error when there is a semantic error in a speech recognition result obtained by the speech recognition module or the H / O error post-processing module; And

And a speech synthesis module for converting the system response into speech data.

The method of claim 5, wherein

And the voice recognition module searches only words in a range corresponding to a system response formed in the conversation model module.

A voice interface client for converting a voice spoken by a user into voice data and transferring the converted voice data to a voice interface server through communication; And

And a voice interface server that performs voice recognition using voice data transmitted from the voice interface client, and obtains a voice recognition result through a human operator when it is determined that an error in voice recognition is large.

The method of claim 7, wherein

The voice interface server is a voice interface server according to any one of claims 1 to 6.

The method of claim 7, wherein

And the voice interface client has an endpoint detection function of voice data converted from voice spoken by the user.

The method of claim 7, wherein

And the voice interface client is a robot.

A speech recognition module for performing speech recognition using speech data;

A dialogue model module for forming a system response that is a misrecognition result of a speech recognition result performed in the speech recognition module or a question for correcting an error when there is a semantic error; And

And a speech synthesis module for converting the question into speech data.

The method of claim 11,

And the voice recognition module searches only words in a range corresponding to a question formed in the conversation model module.

A voice including a voice interface server configured to perform voice recognition using voice data transmitted from the voice interface client, and to form a system response that is a question for correcting an error if a voice recognition result is misrecognized or there is a semantic error. Interface system.

The method of claim 13,

The voice interface server is a voice interface server according to claim 11 or 12.

(a) performing voice recognition using voice data and determining whether an error of the performed voice recognition is performed; And

and (b) an H / O error post-processing step of obtaining a speech recognition result through a human operator when it is determined in step (a) that there is an error in speech recognition.

The method of claim 15,

Step (a) is

(a1) extracting feature parameters from the voice data;

(a2) searching and obtaining a keyword from the extracted feature parameters; And

(a3) determining whether an error of speech recognition is performed by determining whether the key word obtained in step (a2) is a result of a positive recognition or a result of a misrecognition.

The method of claim 16,

Step (a3) is

Determining whether an error of speech recognition is made by using a score value extracted from at least one type of LLR value; And

A speech recognition method comprising the step of determining whether the speech recognition error using the metadata.

The method of claim 16,

Step (a) is

(a4) further comprising the step of reflecting the speaker's voice characteristics in real time to the speaker independent speech characteristics.

The method of claim 16,

Step (a) is

(a5) The voice recognition method performed before step (a1), further comprising a voice endpoint detection step of distinguishing between the silent section and the voice section of the voice data.

The method of claim 19,

Step (a5) is

Detecting a voice endpoint using energy information of the voice; And

Detecting a speech endpoint using GSAP.

The method of claim 19,

Step (a) is

(a6) The voice recognition method performed before the step (a1), further comprising verifying whether the endpoint detected voice data is voice or noise.

The method of claim 19,

Step (a) is

(a7) The voice recognition method performed before the step (a5), further comprising the step of removing static background noise from the voice data.

The method of claim 16,

Step (a) is

(a8) further comprising the step of removing the non-static background noise from the feature parameter extracted in the step (a1).

The method of claim 15,

Step (b) is

And displaying at least one of a cumulative number of speech recognition errors for each user, a word frequently mistaken, at least one word determined to be closest to the word having the error, and a conversation history.

The method of claim 15,

Step (b) is

When the at least one phoneme is typed, listing a word that includes the typed phoneme.

The method of claim 15,

Step (b) is

A speech recognition method comprising the step of varying the speech rate.

The method of claim 15,

(c) forming a question for correcting an error if there is a semantic error in the speech recognition result obtained in step (a) or step (b); And

(d) converting the question into speech data.

The method of claim 27,

Step (c) is

(c1) determining whether there is a semantic error in the speech recognition result obtained in step (a) or step (b);

(c2) forming the question; And

(c3) controlling to search only key words of a range corresponding to the question in the speech recognition performed later.

(a) a voice recognition step of performing voice recognition using voice data;

(b) forming a system response which is a question for correcting an error when there is a mistake in the speech recognition result obtained in step (a) or a semantic error; And

(c) converting the system response into voice data.

The method of claim 29,

Step (b) is

(b1) determining whether a speech recognition result obtained in step (a) has a misrecognition or a semantic error;

(b2) forming the system response; And

(b3) controlling to search only words in a range corresponding to the system response in the speech recognition performed later.