KR100758789B1

KR100758789B1 - Multi-modal system

Info

Publication number: KR100758789B1
Application number: KR1020060048140A
Authority: KR
Inventors: 박성찬; 김학균; 박성수; 안세열; 정영준
Original assignee: 주식회사 케이티
Priority date: 2006-05-29
Filing date: 2006-05-29
Publication date: 2007-09-14

Abstract

A multi-modal system is provided to process a plurality of modalities in parallel at the same time by using SCXML(State Chart eXtensible Markup Language), and offer a user and an application maker a convenient/efficient user interface by easily processing a sequential or concurrent multi-modal I/O(Input/Output) command with the SCXML. A markup language interpreter(240) interprets an SCXML document. A markup language execution engine(250) concurrently processes the plurality of different modality information by activating a state corresponding to the number of modalities used by a client in parallel based on the interpreted SCXML document. The different modalities are voice and non-voice modality. The markup language execution engine receives voice and non-voice modality information from the client in parallel, and fuses the inputted modality information with other modality input information if the inputted modality information is the partial information.

Description

Multi-modal System

도 1은 본 발명의 일실시예에 따른 멀티 모달리티 병렬 처리를 위한 동작 시나리오의 개요를 해럴 상태 차트(Harel State Chart)로 기술한 도면,1 is a diagram illustrating an outline of an operation scenario for multi-modality parallel processing according to an embodiment of the present invention in a Harrel State Chart;

도 2는 본 발명의 일실시예에 따른 멀티 모달리티 병렬 처리를 위한 구체적인 동작 시나리오를 해럴 상태 차트(Harel State Chart)로 기술한 도면,FIG. 2 is a diagram illustrating a specific operational scenario for multi-modality parallel processing according to an embodiment of the present invention in a Harrel State Chart. FIG.

도 3은 도 2의 상태 차트 시나리오에 대한 SCXML 문서,3 is an SCXML document for the statechart scenario of FIG.

도 4는 본 발명에 따른 멀티모달 시스템의 일실시예 구성도,4 is a configuration diagram of one embodiment of a multi-modal system according to the present invention;

도 5는 본 발명에 따른 세션 구조의 일실시예를 나타내는 도면,5 illustrates an embodiment of a session structure in accordance with the present invention;

도 6은 본 발명에 따른 클라이언트의 일실시예 상세 구성도,6 is a detailed configuration diagram of an embodiment of a client according to the present invention;

도 7은 본 발명에 따른 멀티모달 서버의 일실시예 상세 구성도7 is a detailed configuration diagram of an embodiment of a multi-modal server according to the present invention

도 8은 본 발명에 따른 멀티모달 서버의 일실시예 상세 구성도,8 is a detailed configuration diagram of one embodiment of a multi-modal server according to the present invention;

도 9는 본 발명에 따른 멀티 모달리티 입력 정보를 병렬 처리하는 과정에 대한 일실시예 흐름도이다.9 is a flowchart illustrating a process of parallel processing multi-modality input information according to the present invention.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

100: 클라이언트 200: 멀티모달 서버100: Client 200: Multimodal Server

110: 음성 모달리티 처리부 130: 통신부110: voice modality processing unit 130: communication unit

120: 비음성 모달리티 처리부 111: 마이크120: non-voice modality processing unit 111: microphone

112: 스피커 113: VXML 브라우저112: Speaker 113: VXML Browser

121: 키입력부 122: 디스플레이부121: key input unit 122: display unit

123: HTML 브라우저123: HTML browser

본 발명은 멀티모달 시스템에 관한 것으로, 더욱 상세하게는 음성 또는 비음성 모달리티 입력 정보를 동시에 병렬적으로 처리하는 멀티모달 시스템에 관한 것이다.The present invention relates to a multimodal system, and more particularly, to a multimodal system that processes voice or non-voice modality input information in parallel at the same time.

멀티모달이란 여러 개의 모달리티(Modality)라는 뜻으로서, 모달리티는 시각, 청각, 촉각, 미각, 후각 등 각각의 감각채널을 의미하며 이와 같은 각 모달리티를 종합하여 교류하는 것을 멀티모달 상호작용이라 한다. Multimodal means multiple modalities. Modality means each sensory channel such as visual, auditory, tactile, taste, and olfactory, and it is called multimodal interaction.

멀티모달 시스템은 사용자가 음성, 데이터, 비디오, 오디오 또는 다른 정보와 같은 정보, 및 이메일, 날씨 업데이트, 은행거래 및 뉴스 또는 다른 정보를 그래픽 브라우저 또는 음성 브라우저와 같은 사용자 에이전트 프로그램을 통해 하나 이상의 모달리티로 액세스하고, 다른 모달리티로 정보를 수신할 수 있도록 허용한 다. 특히, 사용자는 페치된 요구를 마이크에 말하는 것과 같이, 하나 이상의 모달리티로 정보 페치 요구를 제출한 후, 사용자는 리턴된 정보를 표시 스크린 상에 시각 정보로 제공하는 그래픽 브라우저와 같은 것을 통해, 페치된 정보를 동일한 모달리티(즉, 음성) 또는 다른 모달리티(즉, 그래픽)로 수신할 수 있다. 통신 디바이스 내에, 사용자 에이전트 프로그램은 네트워크 또는 다른 터미널 디바이스에 접속된 디바이스 상에 상주하는 웹 브라우저 또는 다른 적절한 소프트웨어 프로그램과 유사한 방식으로 동작한다.Multimodal systems allow users to send information such as voice, data, video, audio or other information, and email, weather updates, banking and news or other information to one or more modalities through a user agent program such as a graphical browser or voice browser. It allows access to and receive information in different modalities. In particular, after a user submits an information fetch request with one or more modalities, such as speaking a fetched request to the microphone, the user fetches it, such as through a graphical browser that provides the returned information as visual information on the display screen. Information can be received with the same modality (ie, voice) or different modality (ie, graphics). Within a communication device, the user agent program operates in a similar manner to a web browser or other suitable software program residing on a device connected to a network or other terminal device.

멀티모달 시스템은 다수의 모달리티에 대한 인터페이스를 제공하기 때문에 각 모달리티들은 서로 동기화되어 모달리티 간 전환이 가능해야 한다. 예를 들어, 비음성 모달리티를 사용하다가 음성 모달리티로 변경할 수 있으며 그 반대도 가능하여야 하며, 동시에 사용하는 것도 가능하여야 한다. Since a multimodal system provides an interface for multiple modalities, each modality must be synchronized with each other to switch between modalities. For example, the use of non-voice modality can be changed to voice modality, and vice versa, and it must be possible to use it at the same time.

현재 2개 이상의 사용자 입력 모달리티를 허용하는 멀티모달 시스템이 다양하게 제시되고 있다. 그 중 마크업 언어를 사용하는 하나의 제안으로서, 클라이언트에서 하나의 사용자 프로그램 및 HTML(Hyper Text Markup Language) 브라우저가 동작하면 음성을 입력한 후 페이지의 마크업 언어를 이용하여 텍스트 정보를 입력할 수 있도록 하는 X+V형태의 멀티모달 시스템이 있다. X+V란 IBM등에서 제안한 형태로 그래픽 브라우저인 HTML 브라우저와 음성 브라우저인 VXML(Voice Extensible Markup Language) 브라우저가 멀티모달 서버에 임베디드 형태로 탑재된 구조이다. Currently, various multimodal systems have been proposed that allow more than two user input modalities. One suggestion using the markup language is that when a user program and a Hyper Text Markup Language (HTML) browser are running on the client, the user can enter text information using the markup language of the page after inputting voice. There is an X + V type multimodal system. X + V is a form suggested by IBM, etc. It is a structure in which HTML browser, which is a graphic browser, and Voice Extensible Markup Language (VXML), which is a voice browser, are embedded in a multimodal server.

하지만, 상기 제안은 입출력 동기화와 순차 또는 동시 입력 모달리티의 통합 및 처리가 수월한 반면 멀티모달 입출력이 동일한 프로그램 하에서 처리되기 때문 에 HTML 및 VXML 등 기존의 개별 모달리티 모드에서 동작하던 수많은 마크업 언어를 상기 멀티모달 시스템에 맞게 수정하여야 하므로 분산 처리를 위한 개별 모달리티의 요구를 충족하기 어렵다는 문제점이 있다.However, the proposal is easy to integrate and process input / output synchronization and sequential or simultaneous input modality, while multi-modal input / output is processed under the same program. There is a problem that it is difficult to meet the requirements of the individual modality for distributed processing because it must be modified for the modal system.

한편, 또 다른 제안으로서, 동시 모달리티 구현을 위해 사용자 프로그램에 대한 모달리티가 VXML 혹은 HTML을 포함하고 동시 모달리티 태그에 기초하여 특정 명령을 획득하여 멀티모달 입력 정보를 통합하거나 입출력 신호를 동기화하는 마크업 기반의 멀티모달 시스템이 있다(대한민국 공개특허공보 10-2004-0089677 참조). 하지만, 상기 제안은 복수의 모달리티의 동기화를 위하여 동시 멀티모달 동기화 조정기 및 멀티모달 퓨전엔진과 같은 별도의 하드웨어 또는 소프트웨어/펌웨어 구성이 필요하다는 문제점이 있다.On the other hand, as another proposal, for the implementation of concurrent modality, the markup-based that the modality for the user program includes VXML or HTML and acquires a specific command based on the simultaneous modality tag to integrate multi-modal input information or to synchronize the input and output signals There is a multi-modal system of (see Korean Patent Publication No. 10-2004-0089677). However, the proposal has a problem in that separate hardware or software / firmware configurations, such as a simultaneous multimodal synchronization regulator and a multimodal fusion engine, are required to synchronize a plurality of modalities.

따라서, 순차 또는 동시에 입출력되는 2개 이상의 모달리티를 효율적으로 처리할 수 있는 개선된 멀티모달 시스템의 구현이 요구된다.Accordingly, there is a need for an implementation of an improved multimodal system that can efficiently handle two or more modalities that are sequentially input or output simultaneously.

본 발명은 상기 요구에 부응하기 위하여 제안된 것으로서, 상태 차트 마크업 언어를 이용함으로써 다수의 모달리티를 동시에 병렬 처리할 수 있는 멀티모달 시스템 및 그에 따른 멀티 모달리티 입력 처리 방법을 제공하는데 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been proposed to meet the above demands, and an object thereof is to provide a multimodal system capable of simultaneously processing a plurality of modalities in parallel by using a statechart markup language, and a multimodality input processing method thereof.

본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있으며, 본 발명의 실시예에 의해 더욱 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.Other objects and advantages of the present invention can be understood by the following description, and will be more clearly understood by the embodiments of the present invention. Also, it will be readily appreciated that the objects and advantages of the present invention may be realized by the means and combinations thereof indicated in the claims.

상기 목적을 달성하기 위한 본 발명은, 서로 다른 모달리티 정보들을 사용하는 클라이언트와 네트워크로 연결되어 서로 다른 모달리티 정보들을 병렬적으로 처리하기 위한 멀티모달 시스템으로서, 소정의 상태 차트 마크업 언어 문서를 해석하기 위한 마크업 언어 해석부; 상기 해석된 상태 차트 마크업 언어에 따라 클라이언트가 사용하는 모달리티의 개수에 상응하는 상태를 병렬적으로 활성화하여 서로 다른 모달리티 정보들을 동시에 처리하기 위한 마크업 언어 실행부를 포함하는 것을 특징으로 한다.The present invention for achieving the above object is a multi-modal system for processing different modality information in parallel by connecting to a network and a client using different modality information, to interpret a predetermined state chart markup language document Markup language interpreter for; And a markup language execution unit for simultaneously processing different modality information by activating in parallel a state corresponding to the number of modalities used by the client according to the interpreted state chart markup language.

한편, 본 발명은 서로 다른 모달리티 정보들을 병렬적으로 동시에 처리하기 위한 멀티모달 입력 정보 처리 방법으로서, 접속된 클라이언트에 따라 사용자 세션을 생성하는 사용자 세션 생성 단계; 상태 차트 마크업 언어에 기반하여 상기 클라이언트가 사용하는 모달리티의 개수에 상응하는 모달리티 컴포넌트 세션을 상기 사용자 세션의 하위 세션으로 생성하여 병렬적으로 활성화하는 병렬 상태 단계; 및 상기 병렬 상태 단계에 활성화된 모달리티 컴포넌트 세션 각각이 서로 다른 모달리티 입력 정보를 병렬적으로 수신하는 수신 단계를 포함하는 것을 특징으로 한다.On the other hand, the present invention is a multi-modal input information processing method for processing different modality information in parallel at the same time, comprising: a user session creation step of generating a user session according to the connected client; A parallel state step of generating a modality component session corresponding to the number of modalities used by the client as a sub-session of the user session based on a state chart markup language and activating in parallel; And a receiving step in which each of the modality component sessions activated in the parallel state step receives different modality input information in parallel.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다. 또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한, 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다. The following merely illustrates the principles of the invention. Therefore, those skilled in the art, although not explicitly described or illustrated herein, can embody the principles of the present invention and invent various devices that fall within the spirit and scope of the present invention. Furthermore, all conditional terms and embodiments listed herein are in principle clearly intended for the purpose of understanding the concept of the invention and are not to be limited to the specifically listed embodiments and states. Should be. In addition, it is to be understood that all detailed descriptions, including the principles, aspects, and embodiments of the present invention, as well as listing specific embodiments, are intended to include structural and functional equivalents of these matters. In addition, these equivalents should be understood to include not only equivalents now known, but also equivalents to be developed in the future, that is, all devices invented to perform the same function regardless of structure.

따라서, 프로세서 또는 이와 유사한 개념으로 표시된 기능 블럭을 포함하는 도면에 도시된 다양한 소자의 기능은 전용 하드웨어뿐만 아니라 적절한 소프트웨어와 관련하여 소프트웨어를 실행할 능력을 가진 하드웨어의 사용으로 제공될 수 있다. 프로세서에 의해 제공될 때, 상기 기능은 단일 전용 프로세서, 단일 공유 프로세서 또는 복수의 개별적 프로세서에 의해 제공될 수 있고, 이들 중 일부는 공유될 수 있다. 또한, 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니 되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다. Thus, the functionality of the various elements shown in the figures, including functional blocks represented by a processor or similar concept, can be provided by the use of dedicated hardware as well as hardware capable of executing software in conjunction with appropriate software. When provided by a processor, the functionality may be provided by a single dedicated processor, by a single shared processor or by a plurality of individual processors, some of which may be shared. In addition, the use of terms presented in terms of processor, control, or similar concept should not be interpreted exclusively as a citation of hardware capable of executing software, and without limitation, ROM for storing digital signal processor (DSP) hardware, software. (ROM), RAM, and non-volatile memory are to be understood to implicitly include. Other hardware for the governor may also be included.

상술한 목적, 특징 및 장점들은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 더욱 분명해 질 것이다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명한다. The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. In describing the present invention, when it is determined that the detailed description of the related known technology may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 멀티 모달리티 병렬 처리를 위한 동작 시나리오의 개요를 해럴 상태 차트((Harel State Chart)로 기술한 도면이다.FIG. 1 is a diagram illustrating an outline of an operation scenario for multi-modality parallel processing according to an embodiment of the present invention in a Harrel State Chart.

본 발명은 해럴 상태 차트 형태의 시나리오에 따라 음성 모달리티 및 비음성 모달리티와 같은 멀티 모달리티 병렬 처리를 위한 동작 시나리오를 XML 형태로 기술하는 방식을 도입하였다. 상태 차트 형태의 시나리오 기술 방식은 직관적/시각적이고 이벤트에 의한 결정적 프로세스 구조를 갖는다.According to the present invention, a method of describing an operation scenario for multi-modality parallel processing such as voice modality and non-voice modality according to a scenario in the form of a harle state chart is described in XML. The scenario description scheme in the form of a state chart has an intuitive / visual and event-critical process structure.

도 1에 도시된 바와 같이, 상태 차트는 수퍼 상태(super state)(1), 상태(state)(2, 9), 병렬 상태(parallel state)(6, 7), 전이(transition)(8), 행동(action)(4, 5)으로 구성되어 있다. As shown in FIG. 1, the statechart includes a super state 1, states 2 and 9, parallel states 6 and 7, and transition 8. , Action (4) and (5).

임의의 동작 시나리오에 따라 작성된 상태 차트는 상태 차트 XML(State Chart XML, SCXML) 문서로 기술된다. 즉, 상태 차트 XML은 상태 차트 시나리오를 XML 스타일로 매핑한 문서를 말한다. Statecharts created according to any operating scenario are described in a statechart XML (SCXML) document. In other words, statechart XML refers to a document that maps statechart scenarios to XML style.

본 발명에 따른 멀티 모달 시스템은 상태 차트를 이용하여 멀티 모달리티의 병렬 처리를 위한 동작 시나리오를 작성하고, 이를 기계 판독이 가능한 XML 문서로 매핑하여 이용함으로써, 순차 또는 동시에 입출력되는 다수의 음성 또는 비음성 모 달리티를 상태 차트의 시나리오와 동일한 방식으로 병렬적으로 처리한다.The multi-modal system according to the present invention creates an operation scenario for parallel processing of multi-modality using a statechart, and maps it to a machine-readable XML document to use a plurality of voice or non-speech input or output sequentially or simultaneously. Process all Daranti in parallel in the same way as the statechart scenario.

도 2는 본 발명의 일실시예에 따른 멀티 모달리티 병렬 처리를 위한 구체적인 동작 시나리오를 해럴 상태 차트(Heral State Chart)로 기술한 도면이고, 도 3은 도 2의 상태 차트 시나리오에 대한 SCXML 문서이다.FIG. 2 is a diagram illustrating a specific operational scenario for multi-modality parallel processing according to an embodiment of the present invention in a Haral State Chart, and FIG. 3 is an SCXML document for the state chart scenario in FIG.

우선, 본 실시예의 각 상태를 도면 참조부호에 따라 정리하면 아래와 같다.First, each state of this embodiment is summarized according to the reference numerals as follows.

101: 휴지상태의 초기 상태, 110: 휴지상태, 120: system_on 수퍼 상태, 130: 수퍼 병렬상태 파트, 140: 하위 병렬 비음성 모달리티(visual) 상태, 170: 하위 병렬 음성 모달리티(aural) 상태, 141: 비음성 모달리티(visual)의 초기 상태, 150: HTML 수신 대기 상태, 160: HTML의 입력 수신 상태, 180: VXML 수신 대기 상태, 190: VXML 입력 수신 상태, 195: 통합 상태, 199: 종료상태, 111: turn_on 이벤트 전이, 121: turn_off 이벤트 전이, 141: visual의 초기 상태, 151: html_on이벤트 발생 및 조건문 html_value가 부분 정보일 때 상태 150에서 상태 160으로 전이, 152: html_on 이벤트 발생 및 조건문 html_value가 완전 정보일 때의 상태 150에서 상태 195로 전이, 161: timer >= waiting_time일 때 timeout 되어 상태 160에서 상태 150으로 전이, 162: 조건문 time < waiting_time일 때의 타이머 1 증가, 163: 상태 160에서 195로 전이, 171: aural의 초기 상태, 181: vxml_on 이벤트 발생 및 조건문 vxml_value가 부분 정보일 때 상태 180에서 상태 190으로 전이, 182: vxml_on 이벤트 발생 및 조건문 vxml_value가 완전 정보일 때 상태 180에서 상태 195으로 전이, 191: 조건문 timer >= waiting_time일 때timeout되어 상태 190에서 상태 180으로 전이, 192: 조건문 timer < waiting_time일 때의 타이머 1증가, 193: 상태 190에서 상태 195로 전이.101: initial state of idle state, 110: idle state, 120: system_on super state, 130: super parallel state part, 140: lower parallel non-voice modality state, 170: lower parallel speech aural state, 141 : Initial state of non-voice modality (visual), 150: HTML reception state, 160: HTML input reception state, 180: VXML reception state, 190: VXML input reception state, 195: integration state, 199: end state, 111: turn_on event transition, 121: turn_off event transition, 141: initial state of visual, 151: html_on event occurrence and conditional statement transition from state 150 to state 160 when html_value is partial information, 152: html_on event occurrence and conditional statement html_value is complete Transition from state 150 to state 195 for information, 161: timeout when timer> = waiting_time and transition from state 160 to state 150, 162: increase timer 1 by conditional statement time <waiting_time, 163: increase from state 160 to 195 Transformation, 171: Seconds of aural State, 181: vxml_on event occurrence and condition statement transition from state 180 to state 190 when vxml_value is partial information, 182: vxml_on event occurrence and condition statement transition from state 180 to state 195 when vxml_value is complete information, 191: conditional statement timer> = Timeout at waiting_time to transition from state 190 to state 180, 192: conditional timer <increment timer 1 at waiting_time, 193: transition from state 190 to state 195.

도 2에 도시된 바와 같이, 본 실시예의 시나리오에 따라, 초기 상태(101)에서 클라이언트에서 사용자에 의해 발생한 로그인 이벤트(111)에 의해 음성(aural)과 비음성(visual) 모달리티 콤포넌트 2개가 활성화되면 XML 상태 차트 2개의 병렬 상태로 맵핑된다. 상태 (120)에서 병렬 상태임을 나타내는 상태 (130)가 시작되고 그 하위 상태 (140)과 상태 (170)가 동시에 활성화된다.As shown in FIG. 2, in accordance with the scenario of the present embodiment, two voice and visual modality components are activated by a login event 111 generated by a user in a client in an initial state 101. XML state charts are mapped to two parallel states. State 130 is started in state 120 indicating that it is in parallel and its substate 140 and state 170 are activated at the same time.

상태 (140)은 자신의 하위 상태 (150)과 (160)을, 상태 (170)도 자신의 하위상태인 (180)과 (190)을 포함하고 있다.State 140 includes its substates 150 and 160, and state 170 also includes its substates 180 and 190.

따라서, 이벤트 (111)이 발생하면 자동으로 상태 (150)과 (180)로 진입하여 사용자의 입력을 기다리게 된다. 상태 (160)과 (190)은 임시 활성 상태로서 제한된 시간 동안 내에 현재 위치에 머물게 된다. 상태 (195)은 병렬 상태가 종료된 이후 값을 동기화하거나 융합하여 마지막 상태 (199)으로 넘긴다.Accordingly, when the event 111 occurs, the process automatically enters the states 150 and 180 and waits for a user input. States 160 and 190 are temporary active states that will remain in their current location for a limited time. State 195 synchronizes or fuses the value to the last state 199 after the parallel state ends.

이러한 멀티 모달리티 병렬 처리 동작 시나리오에 따라, 본 발명은 클라이언트가 멀티모달 서버에 접속하여 연결된 후 멀티모달 어플리케이션이 시작되어 사용자에게 음성 모드 및 비음성 모드 양쪽으로 정보를 요구하는 경우, 멀티모달 동기화 조정기 또는 융합(fusion) 엔진 등과 같은 별도의 기계 장치 없이 도 3의 SCXML 문서에 기반한 마크업 언어를 실행하는 SCXML실행엔진에 의해 사용자 멀티 모달리티 입력이 융합되어 처리된다.According to this multi-modality parallel processing operation scenario, the present invention provides a multi-modal synchronization coordinator when a multi-modal application is started after a client connects to and connects to a multi-modal server and requests the user information in both voice mode and non-voice mode. User multi-modality input is fused and processed by an SCXML execution engine that executes a markup language based on the SCXML document of FIG. 3 without a separate machine such as a fusion engine.

예를 들어, 사용자가 지도상에서 한 지점을 클릭하면서 "확대" 또는 "축소" 라는 단어를 말하는 경우 수신된 부분 입력 정보에 따라 상태 차트의 상태가 전이함으로써 각기 다른 모달리티의 필드를 채우고 입력정보를 융합한다. 또한, 개별 모달리티에서 융합이 필요 없는 "처음으로" 또는 "리셋" 등의 음성 신호는 이미 완성된 정보로서 이 경우 다른 쪽 모달리티의 값을 동기화시킨다.For example, if a user clicks on a point on the map and says the words "Zoom" or "Zoom", the state of the statechart transitions according to the partial input information received, filling the fields with different modalities and fusing the input information. do. Also, voice signals such as "first time" or "reset" that do not require fusion in the individual modalities are already completed information, in which case the values of the other modalities are synchronized.

도 4는 본 발명에 따른 멀티모달 시스템의 일실시예 구성도이다.4 is a block diagram of an embodiment of a multi-modal system according to the present invention.

도 4에 도시된 바와 같이, 멀티모달 시스템은 다수의 클라이언트(100)가 유선 또는 무선 네트워크(300)를 통해 멀티모달 서버(200)와 연결된다.As shown in FIG. 4, in the multimodal system, a plurality of clients 100 are connected to the multimodal server 200 through a wired or wireless network 300.

클라이언트(100)는 멀티모달 상호작용이 가능한 어플리케이션 소프트웨어를 구동할 수 있는 통신 단말 장치로서, 예를 들면, PDA(Portable Digital Assistant), 휴대 전화기, PC(Personal Computer) 등의 이동형 또는 고정형 단말 장치가 적용될 수 있다. The client 100 is a communication terminal device capable of driving application software capable of multimodal interaction. For example, a mobile or fixed terminal device such as a portable digital assistant (PDA), a mobile phone, or a personal computer (PC) may be used. Can be applied.

클라이언트(100)는 HTML 브라우저 및 VXML 브라우저를 통해 사용자로부터 음성 또는 비음성 모달리티의 신호를 입력받아 입력 신호에 상응하는 입력 컨텐츠(HTML 데이터, VXML 데이터) 및 그에 따른 이벤트 정보를 EMMA 데이터로 변환하여 멀티모달 서버(200)로 송신하고, 멀티모달 서버(200)로부터 리턴된 정보(URL, 이벤트 정보)에 따라 출력 콘텐츠를 음성 또는 비음성 모달리티로 출력한다. 여기서, EMMA 데이터는 상기 입력 컨텐츠 및 이벤트 정보와 함께 EMMA 규격에 따른 메타정보들을 더 포함한다.The client 100 receives a voice or non-voice modality signal from a user through an HTML browser and a VXML browser, and converts input content (HTML data, VXML data) and event information corresponding to the input signal into EMMA data to multiply. It transmits to the modal server 200, and outputs the output content as voice or non-voice modality according to the information (URL, event information) returned from the multi-modal server 200. Here, the EMMA data further includes meta information according to the EMMA standard together with the input content and event information.

EMMA(Extensible Multimodal Annotation Markup Language)란 여러 모달리티 의 입력결과를 통합적으로 표현하기 위해 2004년 W3C에서 제안된 규격으로 입력요소와 인터랙션 관리기를 연결해 주는 표준언어이다. 사용자가 음성 및 키입력 등 여러 입력수단을 사용할 때 의미 태그를 붙여 처리 결과를 표현해 주는 마크업 언어이다. 예를 들면, 입력 시간, 인식결과, 결과의 의미 태그, 신뢰도 등 각종 메타정보가 표현되도록 해 준다. 본 발명에서는 EMMA해석기를 통해 얻어진 결과를 상태 차트 XML 실행 엔진 내에서 사용 가능한 컨텍스트로 변환하여 마크업 제어에 이용한다.EMMA (Extensible Multimodal Annotation Markup Language) is a standard language proposed by the W3C in 2004 to express input results of various modalities. It is a standard language that connects input elements and interaction managers. It is a markup language that expresses processing results by attaching semantic tags when a user uses various input means such as voice and key input. For example, various meta information such as input time, recognition result, semantic tag of the result, and reliability can be expressed. In the present invention, the result obtained through the EMMA interpreter is converted into a context usable in the statechart XML execution engine and used for markup control.

멀티모달 서버(200)는 클라이언트(100)와 네트워크(300)를 통해 연결되어 클라이언트(100)와 순차 또는 동시 멀티모달 입출력 상호작용을 처리하여, 클라이언트로부터 수신된 EMMA 데이터(입력 컨텐츠, 이벤트 정보, 부가 정보)에 상응하는 URL 및 이벤트 정보를 클라이언트로 송신한다.The multimodal server 200 is connected to the client 100 through the network 300 to process sequential or simultaneous multimodal input / output interactions with the client 100, thereby receiving EMMA data (input content, event information, URL and event information corresponding to the additional information) are transmitted to the client.

본 발명에 따른 멀티모달 입출력 상호작용을 설명하면 다음과 같다.Referring to the multi-modal input and output interaction according to the present invention.

멀티모달 서버(200)는 상태 차트 XML로 기술된 멀티 모달리티 병렬 처리를 위한 소정의 동작 시나리오를 입력받아 클라이언트(100) 수와 같은 N개의 사용자 세션을 활성화하고, 각 사용자 세션의 하위 세션으로서 클라이언트가 지원하는 모달리티들에 대응시켜 스테이트 차트 세션 및 모달리티 컴포넌트 세션(HTML 세션, VXML 세션)을 활성화하고, 병렬적으로 처리함으로써, 멀티모달 입출력 상호작용을 처리한다.The multi-modal server 200 receives a predetermined operation scenario for multi-modality parallel processing described in statechart XML and activates N user sessions such as the number of clients 100, and the client as a sub-session of each user session. It handles multimodal I / O interactions by activating and parallelizing statechart sessions and modality component sessions (HTML sessions, VXML sessions) in response to the supporting modalities.

여기서, 멀티모달 서버는 이러한 멀티모달 입출력 상호작용을 통해 외부의 웹서버 또는 서비스 번들 같은 외부 컴포넌트와 연동하여 클라이언트로부터 입력받 은 입력 컨텐츠를 처리하여, 이에 상응하는 출력 컨텐츠 및 이벤트 정보를 클라이언트로 송신한다.Here, the multimodal server processes the input content received from the client by interworking with an external component such as an external web server or a service bundle through such multimodal input / output interaction, and transmits corresponding output content and event information to the client. do.

도 5는 본 발명에 따른 세션 구조의 일실시예를 나타내는 도면이다.5 is a diagram illustrating an embodiment of a session structure according to the present invention.

도 5에 도시된 바와 같이, 본 발명에 따른 세션은 사용자 세션, 스테이트 차트 세션(SC 세션) 및 모달리티 컴포넌트 세션(HTML 세션, VXML 세션)으로 구성된다.As shown in Fig. 5, a session according to the present invention consists of a user session, a statechart session (SC session) and a modality component session (HTML session, VXML session).

사용자 세션은 클라이언트와 상호작용을 하는 전 과정에서 유지되고, SC 세션 및 모달리티 컴포넌트 세션은 사용자 세션이 수행되는 과정에서 생성되어 유지된다. User sessions are maintained throughout the interaction with the client, while SC sessions and modality component sessions are created and maintained during the user session.

하위 세션들은 상위 세션의 시간 축 이내에 놓이게 된다. 즉, 모달리티 컴포넌트 세션은 SC 세션보다 짧을 수 있으며 각 SC 세션 및 모달리티 컴포넌트 세션은 각기 다른 지속시간(duration)을 갖는다.Lower sessions lie within the time axis of the upper session. That is, the modality component session may be shorter than the SC session, and each SC session and the modality component session have different durations.

음성 모달리티 및 비음성 모달리티를 동시에 사용할 수 없는 시나리오인 경우를 예로 들어 설명하면, 사용자가 HTML 브라우저를 통해 그래픽 유저 인터페이스(이하 GUI)로 상호작용하는 중에 음성 콤포넌트 입력이 추가된 경우, HTML 세션은 음성 입출력이 지속되는 동안 잠시 멈춘다. 이어서, HTML 세션이 추가된 후에, VXML 세션이 처리되고, 이어서 HTML 세션에 의해 상호작용이 완성된다.For example, in the scenario where voice modality and non-voice modality cannot be used at the same time, if a voice component input is added while the user is interacting with a graphical user interface (hereinafter referred to as a GUI) through an HTML browser, the HTML session is voiced. Pause while I / O continues. Then, after the HTML session is added, the VXML session is processed, and then the interaction is completed by the HTML session.

이 경우, 하나의 VXML 세션과 두 개의 분리된 HTML 세션이 존재한다. 첫 번째 HTML 세션은 VXML 세션이 시작되기 전에 시작되고 끝난다.In this case, there is one VXML session and two separate HTML sessions. The first HTML session starts and ends before the VXML session begins.

VXML 세션의 지속시간은 차례로 두 번째 HTML 세션과 겹치며 그 이전에 종료된다. SC 세션은 상기 세 가지 모달리티 콤포넌트 세션(2개의 HTML 세션 및 하나의 VXML 세션)을 모두 포함하게 된다.The duration of the VXML session in turn overlaps with the second HTML session and ends before that. The SC session will include all three modality component sessions (two HTML sessions and one VXML session).

한편, 음성 모달리티 및 비음성 모달리티를 동시에 사용하는 시나리오의 경우에는 HTML 세션 및 VXML 세션이 순차 또는 병렬적으로 동시에 처리된다.On the other hand, in the scenario of using voice modality and non-voice modality simultaneously, HTML sessions and VXML sessions are processed sequentially or in parallel.

일단 사용자가 로그인하게 되면 사용자 세션이 생성되고 불러오기 하여 메모리에 적재함으로써, 사용자 세션 하위에 SC세션을 생성한다. 이어서, Once a user logs in, a user session is created, loaded and loaded into memory, creating an SC session underneath the user session. next,

사용자 세션의 초기화는 멀티 모달 서버에 의해 수행된다.Initialization of the user session is performed by a multi-modal server.

HTML 브라우저 및 VXML 브라우저와 같은 사용자 어플리케이션이 시작되면, 멀티모달 서버는 사용자 인증을 거쳐 해당 사용자 세션을 생성하고 멀티모달 서버에 저장되어 있는 SCXML 문서를 읽어 들여 SC 세션을 생성하여 초기화하고, 상기 SCXML의 시나리오에 따라 HTML 모달리티 및 VXML 모달리티 컴포넌트를 구동시키기 위한 모달리티 컴포넌트 세션을 생성하여 초기화한다.When a user application such as an HTML browser and a VXML browser is started, the multimodal server generates a user session through user authentication, reads an SCXML document stored in the multimodal server, creates and initializes an SC session, and initializes the SCXML. Depending on the scenario, create and initialize a modality component session for driving the HTML and VXML modality components.

또한, 멀티모달 서버는 초기화 시에 각각의 콤포넌트 세션이 시작되도록 클라이언트로 수행 메시지를 URL 또는 이벤트 정보로 송신한다.In addition, the multimodal server sends a performance message as URL or event information to the client so that each component session is started upon initialization.

각각의 모달리티 콤포넌트 세션(HTML 세션, VXML세션)은 클라이언트가 마크업 언어로 종료시점을 알려오거나(예, <exit> tag), SCXML 시나리오에 따라 중지(Halt) 이벤트가 발생하면 종료된다.Each modality component session (HTML session, VXML session) terminates when the client notifies the end point in the markup language (eg <exit> tag), or a Halt event occurs according to the SCXML scenario.

이처럼 사용자 세션은 SC 세션의 운용에, SC 세션은 모달리티 컴포넌트 세션의 운용에 필요한 라이프 사이클 이벤트를 생성하여 클라이언트와 비동기 통신을 한다. 상위 세션이 종료되면 모든 하위 세션은 자동으로 종료된다.As such, the user session generates a life cycle event necessary for the operation of the SC session, and the SC session performs the asynchronous communication with the client. When the parent session ends, all child sessions are automatically terminated.

도 6은 본 발명에 따른 클라이언트의 일실시예 상세 구성도이다.6 is a detailed configuration diagram of an embodiment of a client according to the present invention.

도 6에 도시된 바와 같이, 클라이언트(100)는 음성 모달리티 정보를 처리하기 위한 음성 모달리티 처리부(110), 그래픽 모달리티와 같은 비음성 모달리티 정보를 처리하기 위한 비음성 모달리티 처리부(120) 및 각종 데이터들을 네트워크를 통해 송수신하기 위한 통신부(130)를 포함한다.As shown in FIG. 6, the client 100 may include a voice modality processor 110 for processing voice modality information, a non-voice modality processor 120 for processing non-voice modality information such as graphic modality, and various data. It includes a communication unit 130 for transmitting and receiving through a network.

음성 모달리티 처리부(110)는 음성 신호를 입력받기 위한 마이크(111), 음성 신호를 출력하기 위한 스피커(112) 및 음성 브라우저인 VXML 브라우저(113)를 포함한다. The voice modality processor 110 includes a microphone 111 for receiving a voice signal, a speaker 112 for outputting a voice signal, and a VXML browser 113 which is a voice browser.

VXML 브라우저(113)는 마이크(111)를 통해 입력된 음성 신호의 입력 시간이 소정의 임계치 이상으로 지속되면 음성입력 신호로 간주하여, 음성입력의 시작점과 끝점을 추출하여 입력된 음성 신호 데이터를 음성 인식하여 텍스트 데이터를 생성하고, 이를 EMMA 데이터 포맷으로 변환하여 입력된 음성 신호의 EMMA 데이터를 통신부(130)를 통해 멀티모달 서버(200)로 전송하며, 멀티모달 서버로부터 리턴된 음성 모달리티에 대한 출력 컨텐츠에 따라 음성 출력 신호를 합성하여 스피커(112)를 통해 출력한다. 여기서, 외부의 웹서버와의 연동 등과 관련된 VXML 브라우저의 일반적인 기능은 당업자에게 자명한 공지기술인 바 더욱 구체적인 설명은 생략하기로 한다.When the input time of the voice signal input through the microphone 111 continues above a predetermined threshold, the VXML browser 113 considers the voice input signal, extracts the start point and the end point of the voice input, and inputs the input voice signal data. Recognize and generate text data, convert it to EMMA data format and transmit EMMA data of the input voice signal to the multi-modal server 200 through the communication unit 130, and outputs the voice modality returned from the multi-modal server The audio output signal is synthesized according to the content and output through the speaker 112. Here, the general functions of the VXML browser related to interworking with an external web server are well known to those skilled in the art, and thus, more detailed description thereof will be omitted.

VXML 브라우저의 상세 구성은 도 7을 참조하여 후술하기로 한다.The detailed configuration of the VXML browser will be described later with reference to FIG. 7.

비음성 모달리티 처리부(120)는 사용자로부터 비음성 모달리티의 신호를 입력받기 위한 키입력부(121), 그래픽 모달리티의 신호를 출력하기 위한 디스플레이부(122) 및 그래픽 브라우저로서 HTML 브라우저(123)를 포함한다.The non-voice modality processing unit 120 includes a key input unit 121 for receiving a non-voice modality signal from a user, a display unit 122 for outputting a graphic modality signal, and an HTML browser 123 as a graphic browser. .

HTML 브라우저(123)는 키입력부(121)로부터 입력받은 사용자 입력 신호에 상응하는 입력 컨텐츠를 EMMA 데이터 포맷으로 변환하여 데이터 송신부(130)를 통해 송신하고, 멀티모달 서버로부터 리턴된 비음성 모달리티의 출력 컨텐츠에 상응하는 출력 신호를 디스플레이부(122)를 통해 사용자에게 시각 정보로서 디스플레이한다.The HTML browser 123 converts the input content corresponding to the user input signal received from the key input unit 121 into the EMMA data format and transmits it through the data transmission unit 130, and outputs the non-voice modality returned from the multi-modal server. The output signal corresponding to the content is displayed as visual information to the user through the display 122.

HTML 브라우저(123)는 당업자에게 자명한 공지 기술인 바 구체적인 설명은 생략하기로 한다. 키입력부(121)는 키패드, 키보드 또는 터치 스크린 등이 적용될 수 있다.Since the HTML browser 123 is well known to those skilled in the art, detailed description thereof will be omitted. The key input unit 121 may be a keypad, a keyboard or a touch screen.

도 7은 본 발명에 따른 VXML 브라우저의 일실시예 상세 구성도이다.7 is a detailed block diagram of an embodiment of a VXML browser according to the present invention.

도 7에 도시된 바와 같이 VXML 브라우저는 음성특징추출부(131), VXML 인터프리터(132), 인식기(133) 및 합성기(134)를 포함한다.As shown in FIG. 7, the VXML browser includes a voice feature extractor 131, a VXML interpreter 132, a recognizer 133, and a synthesizer 134.

음성특징추출부(131)는 마이크(111)를 통해 입력된 음성 신호의 입력 시간이 소정의 임계치 이상으로 지속되면 음성입력 신호로 간주하여, 음성입력의 시작점과 끝점을 추출하여 입력된 음성 신호 데이터를 획득하여 VXML 인터프리터(132)로 출력한다. 여기서, 음성특징추출부(112)에서 처리하는 과정까지를 음성 전처리 과정이라 한다.The voice feature extracting unit 131 considers the voice input signal when the input time of the voice signal input through the microphone 111 is longer than a predetermined threshold value, and extracts a start point and an end point of the voice input and input the voice signal data. Is obtained and output to the VXML interpreter 132. Here, the process up to the processing by the voice feature extraction unit 112 is called a voice preprocessing process.

인식기(133)는 입력된 음성 신호를 인식하여 텍스트 데이터를 출력하고, 합성기(134)는 입력된 텍스트 데이터를 음성 신호로 합성하여 출력한다. The recognizer 133 recognizes the input voice signal and outputs text data, and the synthesizer 134 synthesizes the input text data into a voice signal and outputs the text data.

여기서, 인식기(133)는 일반적인 ASR(Automatic Speech Recognizer)를 적용할 수 있으며, 합성부 역시 일반적인 TTS(Text to Speech) 엔진을 적용할 수 있다.Here, the recognizer 133 may apply a general Automatic Speech Recognizer (ASR), and the synthesizer may also apply a general Text to Speech (TTS) engine.

VXML 인터프리터(132)는 음성특징추출부(131)로부터 입력된 음성 신호를 인식기(133)로 출력하여 음성 인식을 지시하고 인식된 텍스트 데이터를 EMMA데이터 포맷으로 변형하여 멀티모달 서버(200)로 전송하며, 멀티모달 서버(200)로부터 입력받은 음성 모달리티에 대한 출력 컨텐츠에 따라 음성 합성할 합성 대상 텍스트 정보를 합성기(134)로 출력하며 음성 합성을 지시한다.The VXML interpreter 132 outputs a voice signal input from the voice feature extractor 131 to the recognizer 133 to instruct voice recognition, transforms the recognized text data into an EMMA data format, and transmits the received voice signal to the multimodal server 200. In addition, the synthesizer 134 outputs synthesis target text information to be synthesized according to the output content of the voice modality received from the multi-modal server 200 and instructs voice synthesis.

합성기(134)는 VXML 인터프리터(132)로부터 합성 대상 텍스트 정보를 입력받고, 상기 텍스트 정보에 따라 음성합성을 수행하고 음성 합성 신호를 생성하여 스피커(112)를 통해 출력한다. The synthesizer 134 receives synthesis target text information from the VXML interpreter 132, performs speech synthesis according to the text information, generates a speech synthesis signal, and outputs the synthesized speech signal through the speaker 112.

본 실시예에서는 VXML 브라우저(113)가 클라이언트(100) 측에 포함되어 구성되었으나, 음성 전처리 과정만 클라이언트(100) 측에서 수행하고, 음성 후처리 과정은 서버 측에서 수행되도록 VXML 브라우저(113)를 서버 측에 포함시키는 분산형 음성 인식 구조로 구현할 수도 있다. In this embodiment, the VXML browser 113 is configured to be included on the client 100 side, but only the voice preprocessing process is performed on the client 100 side, and the voice postprocessing process is performed on the server side. It may also be implemented as a distributed speech recognition structure included on the server side.

즉, 서버와 클라이언트의 분산 처리 정책에 따라, 클라이언트 측의 음성 전처리 과정에서 시작점/끝점 추출과 특징추출이 일어나고 추출된 음성 정보가 멀티모달 서버의 VXML 브라우저로 전송되고 서버 측에 포함된 VXML 브라우저(210)는 음성 후처리를 담당하는 구조로 구현할 수도 있다. That is, according to the distributed processing policy of the server and the client, the start point / end point extraction and feature extraction occur in the voice preprocessing process of the client side, and the extracted voice information is transmitted to the VXML browser of the multimodal server and the VXML browser included in the server side ( 210 may be implemented as a structure in charge of voice post-processing.

도 8은 본 발명에 따른 멀티모달 서버의 일실시예 상세 구성도이다.8 is a detailed block diagram of an embodiment of a multi-modal server according to the present invention.

도 8에 도시된 바와 같이, 멀티모달 서버(200)는 통신부(210), EMMA 해석부(220), 세션 관리부(230), SCXML 해석부(240), SCXML 실행엔진(250) 및 프로파일 DB(250)를 포함한다.As shown in FIG. 8, the multi-modal server 200 may include a communication unit 210, an EMMA analyzer 220, a session manager 230, an SCXML interpreter 240, an SCXML execution engine 250, and a profile DB ( 250).

통신부(210)는 네트워크를 통해 클라이언트와 각종 데이터를 송수신한다.The communication unit 210 transmits and receives various data with the client through the network.

EMMA 해석부(220)는 클라이언트로부터 수신된 EMMA 데이터를 해석하고 EMMA 데이터에 포함된 입력 컨텐츠, 이벤트 정보 및 메타 정보를 추출하여 세션 관리부(230) 및 SCXML 실행 엔진(250)으로 출력한다.The EMMA interpreter 220 interprets the EMMA data received from the client, extracts input content, event information, and meta information included in the EMMA data, and outputs the extracted content to the session manager 230 and the SCXML execution engine 250.

세션 관리부(230)는 상기 사용자 세션, SC 세션 및 모달리티 컴포넌트 세션들을 생성/유지/활성화/종료 관리하며, 활성화된 세션을 통해 출력 컨텐츠를 송신하도록 한다. 즉, 세션 관리부(230)는 새로운 사용자가 접속할 때마다 세션 아이디를 부여하여 세션을 생성하여 관리한다. 보다 구체적으로, 세션 관리부(230)는 EMMA 해석부(220)로부터 입력받은 정보 및 SCXML 실행엔진(250)을 통한 SCXML 시나리오에 따른 상호작용에 따라 도 5를 참조하여 전술한 바와 같이 상기 세션들의 생성, 유지, 활성화 및 종료 프로세스를 처리한다.The session manager 230 generates, maintains, activates, and terminates the user session, the SC session, and the modality component sessions, and transmits the output content through the activated session. That is, the session manager 230 creates and manages a session by giving a session ID every time a new user accesses the session. More specifically, the session manager 230 generates the sessions as described above with reference to FIG. 5 according to the information received from the EMMA interpreter 220 and the interaction according to the SCXML scenario through the SCXML execution engine 250. Handles maintenance, activation and shutdown processes.

SCXML 해석부(240)는 SCXML 문서를 입력받아 컴파일하고, 세션 관리부(230)에서 활성화된 세션에 상응하는 자바 오브젝트 모델을 생성하여 SCXML 실행엔진으로 출력한다. 본 실시예에 따른 SCXML 해석부는 일반적인 XML 마크업 해석기와 기본적으로 유사하게 동작한다.The SCXML interpreter 240 receives and compiles the SCXML document, generates a Java object model corresponding to the session activated by the session manager 230, and outputs the generated Java object model to the SCXML execution engine. The SCXML parser according to the present embodiment basically operates similarly to a general XML markup parser.

SCXML 실행엔진(250)은 상기 SCXML 해석부(240)에서 생성된 자바 오브젝트 모델을 이용하여 SCXML 문서에 따라 멀티 모달리티 병렬 처리 프로세스를 실행한다. 즉, SCXML 실행엔진은 일반적인 XML 실행엔진과 같이 이벤트를 기반으로 상태를 전이하는 방식으로 SCXML 마크업 스크립트가 SCXML 해석부(240)로부터 자바 오브젝트 모델로 변환되어 입력되면 스크립트 시나리오를 실행하여, 세션 관리부(230)를 통해 활성화된 모달리티 컴포넌트 세션들을 순차 또는 동시에 병렬적으로 처리한다.The SCXML execution engine 250 executes a multi-modality parallel processing process according to the SCXML document using the Java object model generated by the SCXML interpreter 240. That is, the SCXML execution engine executes the script scenario when the SCXML markup script is converted into the Java object model from the SCXML interpreter 240 and inputted in a manner of transitioning states based on an event like a general XML execution engine, thereby executing the script scenario. Process the activated modality component sessions sequentially via 230 at the same time or in parallel.

SCXML 실행엔진(250)은 클라이언트로부터 수신된 이벤트 및 프로파일 DB(260)에 저장되어 있는 프로파일 정보를 이용하여 자바 오브젝트 모델의 상태를 전이시키거나 또는 상응하는 출력 콘텐츠를 통신부를 통해 직접 송신하거나, 요청 신호를 생성하여 웹서버 또는 외부 번들(OSGi 모듈)로 송신하고, 리턴된 데이터에 따라 출력 컨텐츠를 생성하여 통신부를 통해 클라이언트에 송신하도록 한다.The SCXML execution engine 250 uses the event received from the client and the profile information stored in the profile DB 260 to transition the state of the Java object model or directly send the corresponding output content through the communication unit, or request The signal is generated and transmitted to a web server or an external bundle (OSGi module), and the output content is generated according to the returned data and transmitted to the client through the communication unit.

프로파일 DB(260)는 프로파일 정보들을 저장 관리하는 데이터베이스이다.The profile DB 260 is a database that stores and manages profile information.

상기 프로파일 정보는 DCI(Delivery Context Component)라고도 하며 디바이스 환경 정보 및 사용자 선호도 정보를 포함한다. 디바이스 환경 정보는 GPS 수신 정도, 배터리 정보, 네트워크 정보 등 모바일 환경에서 동적으로 변화되는 정보이고 사용자 선호도 정보는 뷰 선호도, 특정 모달리티 선호도, 후보단어(N-best) 선택, 지능 레벨 선택, 동시 입력시 제한 시간 선택 등 개별 클라이언트마다 다르게 선택할 수 있는 정보이다.The profile information is also called a delivery context component (DCI) and includes device environment information and user preference information. The device environment information is dynamically changed in the mobile environment such as GPS reception, battery information, and network information, and the user preference information is the view preference, the specific modality preference, the N-best selection, the intelligence level selection, and the simultaneous input. Information that can be selected differently for each client, such as timeout selection.

본 실시예에 따라, 멀티모달 서버(200)의 SCXML 해석부(240)가 동작하여 SCXML 문서를 컴파일링하고 자바 오브젝트 모델을 생성하여 SCXML 실행 엔진을 구동시키게 되면, SCXML 문서에 따른 시나리오의 내용대로 HTML, VXML 브라우저와 같은 각 모달리티 컴포넌트로 하여금 SCXML 문서에 기술된 URL을 파싱하도록 명령하고, 클라이언트는 이에 따라 HTML 또는 VXML 브라우저를 입출력이 가능한 준비 상태로 활성화하여 사용자에게 통보한다. According to the present embodiment, when the SCXML interpreter 240 of the multi-modal server 200 operates to compile the SCXML document, generate a Java object model, and drive the SCXML execution engine, as described in the scenario according to the SCXML document. Each modality component, such as HTML or VXML browser, is instructed to parse the URL described in the SCXML document, and the client accordingly activates the HTML or VXML browser in a ready state capable of inputting and outputting to the user.

보다 구체적으로 설명하면, 음성 모달리티의 입력 컨텐츠 및 이벤트 정보는 VXML 브라우저를 통해 EMMA 데이터로 변환되어 멀티모달 서버로 전송되고, 펜 또는 키 입력과 같은 비음성 입력 컨텐츠 및 이벤트 정보는 HTML 브라우저를 통해 EMMA 데이터로 변환되어 멀티모달 서버로 곧바로 전송되면, 멀티모달 서버는 SCXML 문서의 스크립트의 멀티 모달리티 병렬 처리 시나리오에 따라 멀티 모달리티 입력 컨텐츠를 병렬 처리하여, 클라이언트의 VXML 브라우저 또는 HTML 브라우저를 구동시켜 출력 컨텐츠를 전송한다. 상기 멀티 모달리티 병렬 처리 과정은 도 9를 참조하여 후술하기로 한다.More specifically, the input content and event information of the voice modality is converted into EMMA data through the VXML browser and transmitted to the multimodal server, and the non-voice input content and event information such as pen or key input is transmitted through the HTML browser. Once converted to data and sent directly to the multimodal server, the multimodal server parallelizes the multimodality input content according to the multimodality parallelization scenario of the script in the SCXML document, and runs the client's VXML browser or HTML browser to display the output content. send. The multi-modality parallel processing will be described later with reference to FIG. 9.

우선, 클라이언트가 접속하면, 사용자 세션을 생성하고(910), SCXML 스크립트를 컴파일하여 상기 사용자 세션에 상응하는 자바 오브젝트 모델을 생성하여 SC 세션을 생성하고(920), 해당 클라이언트가 지원하는 모달리티 컴포넌트 세션(HTML 세션 및 VXML 세션)을 생성한다(930).First, when a client connects, a user session is generated (910), a SCXML script is compiled to generate a Java object model corresponding to the user session to generate an SC session (920), and a modality component session supported by the client. Create an HTML session and a VXML session (930).

이어서, 상기 HTML 세션 및 VXML 세션을 병렬 활성화시킴으로써, 클라이언트의 HTML 브라우저 및 VXML브라우저를 활성화하여 각각 사용자 입력 수신 대기 상태로 진행한다(950, 960).Subsequently, by activating the HTML session and the VXML session in parallel, the client's HTML browser and the VXML browser are activated to proceed to receive user input, respectively (950 and 960).

950 과정 또는 960 과정에서 각 모달리티의 입력 정보를 병렬적으로 수신하면, 수신된 입력 정보가 부분 정보인지 판단한다(951, 961).When the input information of each modality is received in parallel in step 950 or 960, it is determined whether the received input information is partial information (951, 961).

여기서, 상기 부분 정보란 그 자체로서 완전한 의미를 갖지 못하고 다른 쪽 모달리티의 부분 정보와 통합되어야만 처리가능한 정보를 말한다. 완전 정보란 그 자체로서 완전한 의미를 가지고 다른 모달리티의 정보 없이 처리 가능한 정보를 말한다.Here, the partial information does not have complete meaning by itself and refers to information that can be processed only when integrated with the partial information of the other modality. Complete information is information that is complete in its own right and can be processed without any other modality information.

예를 들어, 사용자가 HTML 브라우저를 통해 표출된 지도상에서 한 지점을 클릭하면서 HTML 브라우저를 통해 클릭 정보를 입력하는 동시에, "확대" 또는 "축소"라는 음성 정보를 VXML 브라우저를 통해 입력하는 경우, HTML과 VXML 브라우저로부터 입력되는 상기 클릭 정보 및 음성 정보는 부분 정보가 된다.For example, if a user clicks a point on the map displayed through an HTML browser and enters click information through the HTML browser, while entering voice information such as "zoom in" or "zoom out" through the VXML browser, the HTML And the click information and the voice information input from the VXML browser become partial information.

상기 판단 결과(951, 961), 수신된 입력 정보가 완전 정보인 경우에는 병렬 상태를 종료하고 수신된 입력 정보를 개별적으로 처리한다(960).As a result of the determination (951, 961), if the received input information is complete information, the parallel state is terminated and the received input information is processed separately (960).

상기 판단 결과(951, 961), 수신된 입력 정보가 부분 정보인 경우에는 다른 모달리티 입력 수신 대기 상태로 진행하고(952, 962), 소정의 타임아웃 시간 내에 다른 모달리티 입력 정보가 수신되었는지를 판단한다(953, 963),As a result of the determination (951, 961), if the received input information is partial information, it proceeds to the other modality input reception state (952, 962), and determines whether other modality input information has been received within a predetermined timeout time. (953, 963),

다른 모달리티의 부분 정보가 수신되면, 병렬 상태를 종료하고, 수신된 서로 다른 양 모달리티의 부분 정보를 융합하여 처리한다(970, 980).When the partial information of the different modalities is received, the parallel state is terminated and the received partial information of the different modalities are converged and processed (970 and 980).

멀티모달 출력도 이와 마찬가지로 모달리티 수만큼의 병렬 상태를 초기화하여 상기 플로우와 기본적인 동일한 알고리즘을 통해 처리함으로써, 각 모달리티를 독립적으로 음성 합성기 또는 디스플레이 장치를 통해 출력할 수 있음은 자명한 바, 별도의 설명은 생략하기로 한다.Similarly, the multimodal output is initialized by the same number of modalities and processed through the same algorithm as that of the flow, so that each modality can be independently output through a speech synthesizer or a display device. Will be omitted.

상기 실시예에서는 음성 모달리티와 키패드를 통한 비음성 모달리티의 두 가지 모달리티의 입력 장치를 가정하였으나, 상기 실시예는 하나의 예에 불과하며, 본 발명에서 멀티모달 시스템을 위한 입력 장치로서 음성 입력, 키보드 입력, 마우스 입력, 펜 입력, 터치스크린 입력, 제스처 입력, 안구 이동 입력 등 다양한 모달리티의 입력 장치가 적용될 수 있음은 자명하다 할 것이다.In the above embodiment, it is assumed that an input device having two modalities, that is, voice modality and non-voice modality through a keypad, is only one example. In the present invention, a voice input and a keyboard are used as input devices for a multi-modal system. Obviously, various modality input devices such as input, mouse input, pen input, touch screen input, gesture input, and eye movement input can be applied.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form. Since this process can be easily implemented by those skilled in the art will not be described in more detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

상기와 같은 본 발명은 SCXML 마크업 언어를 이용하여 음성 또는 비음성 모달리티 입/출력 정보를 동시에 병렬적으로 처리할 수 있는 효과가 있다.As described above, the present invention has the effect of simultaneously processing voice or non-voice modality input / output information in parallel using SCXML markup language.

또한, 본 발명은 마크업 언어를 이용하여 순차적 또는 동시 멀티모달 입/출력 명령을 용이하게 처리할 수 있으므로, 사용자와 애플리케이션 제작자에게 보다 편리하고 효율적인 사용자 인터페이스를 제공할 수 있는 효과가 있다.In addition, the present invention can easily process a sequential or simultaneous multi-modal input / output command using a markup language, there is an effect that can provide a more convenient and efficient user interface to users and application creators.

또한, 본 발명은 클라이언트의 VXML 혹은 HTML의 시나리오 기술언어와 흡사한 SCXML 언어를 도입하여 이용함으로써, 서비스 운영자 입장에서 기존의 클라이언트 모달리티의 기능 또는 동작 범위를 바꾸지 않고 기본적으로 동일한 구조하에서 독립성과 자율성을 보장하며 멀티모달 시스템을 운영할 수 있도록 하는 효과가 있다.In addition, the present invention introduces and uses the SCXML language similar to the scenario description language of the client's VXML or HTML, so that independence and autonomy under the same structure without changing the function or operation range of the existing client modality from the service operator's point of view. It has the effect of guaranteeing the operation of a multimodal system.

Claims

A multimodal system for processing different modality information in parallel by connecting a network with a client using different modality information,

A markup language interpreter for interpreting a predetermined state chart markup language document;

A markup language execution unit for simultaneously processing different modality information by activating in parallel a state corresponding to the number of modalities used by a client according to the interpreted state chart markup language;

Multi-modal system comprising a.

The method of claim 1,

The markup language is a multi-modal system, characterized in that the state chart XML (SCXML) for a state chart that supports parallel state processing.

The method of claim 1,

Wherein said different modalities are voice modality and non-voice modality.

The method of claim 3, wherein

The markup language execution unit,

And receiving voice modality information and non-voice modality information from the client in parallel, and processing the converged with other modality input information when the input modality information is partial information.

The method of claim 4, wherein

The modality information is received from a client in the form of EMMA data.

The method of claim 1,

A session manager for generating and managing a session corresponding to the connected client according to the input information of the connected client and the control of the markup language execution unit.

Multimodal system further comprising.

The method of claim 6,

The session,

A user session corresponding to each of the connected clients;

A statechart session created according to the markup language as a sub-session of the user session;

A subsession of the statechart session, the modality component session corresponding to the modality component used by the connected client.

Multimodal system comprising a.

The method of claim 7, wherein

The modality component session,

Multimodal system, characterized in that it is a voice modality session and a non-voice modality session.

A multi-modal input information processing method for simultaneously processing different modality information in parallel,

Generating a user session according to a connected client;

A parallel state step of generating a modality component session corresponding to the number of modalities used by the client as a sub-session of the user session based on a state chart markup language and activating in parallel; And

A receiving step in which each of the modality component sessions activated in the parallel state step receive different modality input information in parallel

Multi-modal input information processing method comprising a.

The method of claim 9,

Determining whether the input information received in the receiving step is partial information; And

In the determination result, when the input information is partial information, a fusion processing step of fusion processing with input information of another modality

Multimodal input information processing method further comprising.

The method of claim 10,

In the determination result, if the input information is complete information, the individual processing step of processing separately regardless of the input information of other modalities

Multimodal input information processing method further comprising.