KR20080051479A

KR20080051479A - Apparatus and method for processing multimodal fusion

Info

Publication number: KR20080051479A
Application number: KR1020060122665A
Authority: KR
Inventors: 이동우; 손용기; 김지은; 선우존; 조일연
Original assignee: 한국전자통신연구원
Priority date: 2006-12-05
Filing date: 2006-12-05
Publication date: 2008-06-11
Also published as: KR100860407B1

Abstract

A device and a method for processing multimodal fusion are provided to infer various kinds of user input modality fusion easily and conveniently in a system including various kinds of modality recognition engines even if the system has a compact and low specification. An input processor(110) processes inferable information by receiving single or various kinds of user input recognition information through voice and gesture recognizers(210,220). A rule storing part(150) stores modality combination rule information by parsing XML(eXtensible Markup Language) type modality combination rule information. An inference engine(120) infers action corresponding to the processed user input recognition information by referring to the modality combination rule information. The XML type modality combination rule information includes a first element defining one action, a second element arranging modalities received as input of the action, a third element presenting a method for combining the arranged input modalities significantly, a fourth element arranging instructions of the modality, a fifth element pointing one instruction, and a sixth element combining many actions.

Description

Apparatus and method for processing multimodal fusion

도 1은 본 발명의 바람직한 일 실시예에 따른 멀티모달 융합기를 나타낸 블록도,1 is a block diagram showing a multi-modal fusion machine according to an embodiment of the present invention,

도 2는 본 발명의 바람직한 일 실시예에 따른 멀티모달 융합기의 모달리티 융합 방법을 나타낸 순서 흐름도이다.2 is a flowchart illustrating a modality fusion method of a multi-modal fusion device according to an exemplary embodiment of the present invention.

*도면의 주요 부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

100 : 멀티모달 융합기 110 : 입력처리부100: multi-modal fusion machine 110: input processing unit

120 : 추론엔진부 130 : 검증부120: inference engine 130: verification unit

140 : 규칙저장부 150 : 피드백 생성부140: rule storage unit 150: feedback generation unit

210 : 음성 인식기 220 : 제스처 인식기210: speech recognizer 220: gesture recognizer

본 발명은 멀티모달 융합 처리 방법 및 그 장치에 관한 것으로, 특히 시스템 에서 다종의 모달리티 융합이 간편하게 수행되도록 하기 위한 멀티모달 융합 처리 방법 및 그 장치에 관한 것이다. The present invention relates to a multi-modal fusion processing method and apparatus, and more particularly, to a multi-modal fusion processing method and apparatus for easily performing a multi-modality fusion in the system.

현재, 기술의 발전으로 각종 모달리티에 대한 인식 엔진들의 성능이 개선되고 있어, 멀티모달 입출력에 대한 관심이 높아지고 있다.Currently, with the development of technology, the performance of recognition engines for various modalities is improved, and interest in multimodal input / output is increasing.

따라서, 일반적인 컴퓨터를 비롯하여 웨어러블 컴퓨터, 무선통신 단말기, 휴대용 멀티미디어 플레이어 등과 같은 시스템은 그동안 멀티모달 입출력을 구현함에 있어서, 복잡하면서도 저성능의 모달리티 인식 엔진들로 인해 사용자 입력을 정확하게 인식하지 못하여 그 구현에 어려움을 가졌으나, 개선된 성능의 모달리티 인식 엔진들을 통해 그 구현이 용이해지게 되었다.Accordingly, systems such as general computers, wearable computers, wireless communication terminals, portable multimedia players, and the like have not been able to accurately recognize user input due to complex and low performance modality recognition engines in implementing multimodal input / output. Although difficult, the implementation is facilitated by improved performance modality recognition engines.

한편, 이와 같은 시스템은 모달리티 인식 엔진을 통해 인식되는 사용자 입력의 의도가 무엇인지 추론하기 위해 멀티모달 융합 시스템을 구비하여야 하는데, 현재 일반적인 멀티모달 융합 시스템은 사용자 의도를 추론하는 과정이 매우 복잡하여 사용자 의도 추론을 위해 대체로 고사양의 서버급 시스템을 요구하고, 자연어 처리를 기반으로 하는 다이어로그 매니지먼트 시스템 및 방대한 양의 지식 정보 시스템을 요구하고 있는 실정이다.On the other hand, such a system should be equipped with a multi-modal fusion system to infer what the intention of the user input is recognized by the modality recognition engine, the current multi-modal fusion system is very complicated to infer the user intention In order to deduce intention, it usually requires a high-end server-class system, a dialogue management system based on natural language processing, and a large amount of knowledge information system.

이에 멀티모달 입출력을 필요로 하는 시스템은, 다종의 모달리티에 대한 인식 엔진들의 성능이 개선되었음에도 불구하고, 구비되는 멀티모달 융합 시스템이 고사양의 시스템 및 방대한 양의 지식정보 시스템을 요구함에 따라, 소형화된 시스템 및 저사양의 시스템으로는 멀티모달 입출력 구현이 어려운 문제점을 가진다.In the system requiring multi-modal input and output, even though the performance of the recognition engines for various modalities is improved, as the multi-modal fusion system provided requires a high specification system and a large amount of knowledge information system, Multi-modal input and output implementations are difficult with systems and low-spec systems.

한편, 이와 같은 문제점을 해결하기 위해, 멀티모달 입출력 시스템이 고사양의 서버급 시스템의 필요없이 다종의 모달리티 사용자 입력을 인식하도록 하는 기술이 개발 중이나, 아직 명확한 그 인식 기술이 정의되지는 않고 있다.On the other hand, in order to solve such a problem, a technique for allowing a multimodal input / output system to recognize a variety of modality user inputs without the need for a high-end server-class system is under development, but the recognition technique is not yet defined.

상기와 같은 문제점을 해결하기 위한 본 발명의 제 1 목적은 다종의 모달리티 인식 엔진을 포함하는 시스템에서 다종의 사용자 입력 모달리티 융합 추론이 쉽고 간편하게 수행되도록 하기 위한 멀티모달 융합 처리 방법 및 그 장치를 제공하는데 있다.A first object of the present invention for solving the above problems is to provide a multi-modal fusion processing method and apparatus for easily and simply performing a variety of user input modality fusion inference in a system including a multi-modality recognition engine. have.

본 발명의 제 2 목적은 다종의 모달리티 융합이 소형화된 저사양의 시스템에서 구현되도록 하기 위한 멀티모달 융합 처리 방법 및 그 장치를 제공하는데 있다.It is a second object of the present invention to provide a multimodal fusion processing method and apparatus for allowing a plurality of modality fusions to be implemented in a miniaturized low specification system.

상기와 같은 본 발명의 목적을 달성하기 위한 본 발명의 멀티모달 융합 장치는, 단일 또는 다종의 사용자 입력 인식 정보를 해당 인식기를 통해 제공받아 추론 가능한 정보로 가공하는 입력처리부; 모달리티 조합 규칙 정보를 파싱하여 저장하는 규칙저장부; 상기 모달리티 조합 규칙 정보를 참조하여 상기 가공된 사용자 입력 인식 정보에 대응하는 액션을 추론하는 추론엔진부를 포함한다.Multi-modal fusion device of the present invention for achieving the object of the present invention as described above, the input processing unit for receiving a single or multiple types of user input recognition information through the corresponding recognizer to process the inference information; A rule storage unit for parsing and storing modality combination rule information; And an inference engine that infers an action corresponding to the processed user input recognition information with reference to the modality combination rule information.

상기 규칙저장부는, XML 형태의 모달리티 조합 규칙 정보를 파싱하는 것을 특징으로 한다.The rule storage unit may parse the modality combination rule information in the XML format.

상기 XML 형태의 모달리티 조합 규칙 정보는, 하나의 액션을 정의하는 제 1 엘러먼트(action), 해당 액션에 입력으로 올 수 있는 모달리티들을 나열하는 제 2 엘러먼트(input), 해당 액션을 위해 상기 나열된 입력 모달리티들의 의미있는 조합을 위한 방법을 제시하는 제 3 엘러먼트(integration), 해당 모달리티의 명령어들을 나열한 제 4 엘러먼트(command), 상기 모달리티의 명령어들 중 하나를 지칭하는 제 5 엘러먼트(modality) 및 여러개의 액션을 조합하기 위한 제 6 엘러먼트(set) 중 적어도 하나 이상을 포함하는 것을 특징으로 한다.The modality combination rule information in the XML form may include a first element defining one action, a second element listing modalities that may be input as a corresponding action, and the above listed for the corresponding action. A third element presenting a method for a meaningful combination of input modalities, a fourth element enumerating instructions of the modality, and a fifth element referring to one of the instructions of the modality And at least one of a sixth element (set) for combining several actions.

상기 제 3 엘러먼트는 상기 나열된 입력 모달리티에 따른 적어도 하나 이상의 조합 조건에 대하여 하나의 조건만을 만족시키면 되는 것을 나타내는 엘러먼트 및 상기 모든 조건을 만족시켜야 한다는 것을 나타내는 엘러먼트 중 적어도 하나 이상을 자식 엘러먼트로 포함하며, 상기 제 4 엘러먼트는 모달리티 명령어 리스트를 나타내는 엘러먼트(item)를 자식 엘러먼트로 포함하며, 상기 제 6 엘러먼트는 이전에 추론된 액션 순서를 정의하여, 상기 각 액션을 나타내는 엘러먼트(actionname)를 자식 엘러먼트로 포함하는 엘러먼트(sequence)와 상기 각 액션들에 대한 의미 있는 최대 입력 시간을 나타내는 엘러먼트(time) 중 적어도 하나 이상을 자식 엘러먼트로 포함하는 것을 특징으로 한다.The third element is a child element of at least one of an element indicating that only one condition needs to be satisfied for at least one or more combination conditions according to the input modalities listed above, and an element indicating that all the conditions must be satisfied. Wherein the fourth element includes an element representing the modality instruction list as a child element, and the sixth element defines a previously inferred action order to represent each action. And at least one or more of an element including an action name as a child element and an element representing a meaningful maximum input time for each of the actions as a child element. .

상기 제 1 엘러먼트는 해당 액션의 이름(name), 해당 액션이 멀티(multi) 또는 싱글(single)인지 여부를 나타내는 상태 값, 같은 이름의 액션을 구분하기 위한 서브네임(subname) 및 시스템의 입력으로 사용될 수 있는 여부를 나타내는 값(send) 중 적어도 하나 이상을 속성으로 포함하며, 상기 제 3 엘러먼트는 웨이팅 값 계산을 위한 가중치 값(weight)을 속성으로 포함하며, 상기 제 4 엘러먼트는 이름 속성 값(name)과 입력 모달리티 모드(mode) 중 적어도 하나 이상을 속성으로 포함하며, 상기 제 6 엘러먼트는 상기 여러개의 액션의 입력 순서를 나타내는 순서 값 및 시간 값 중 적어도 하나 이상을 속성으로 포함하는 것을 특징으로 한다.The first element may include a name of a corresponding action, a state value indicating whether the corresponding action is multi or single, a subname for distinguishing actions of the same name, and a system input. It includes at least one or more of the value (send) indicating whether it can be used as an attribute, the third element includes a weight value (weight) for calculating the weighting value, the fourth element is a name At least one or more of an attribute value (name) and an input modality mode are included as attributes, and the sixth element includes at least one or more of an order value and a time value indicating an input order of the plurality of actions. Characterized in that.

상기 액션은, 상기 시스템의 시스템 명령어인 것을 특징으로 한다.The action may be a system command of the system.

상기 추론엔진부는, 상기 모달리티 조합 규칙 정보 및 기 생성한 액션을 참조하여, 상기 가공된 사용자 입력 인식 정보가 융합 가능 및 융합 필요 정보인지 여부를 파악하고, 융합 가능하며 필요한 정보인 경우 상기 새로운 액션 추론을 수행하는 것을 특징으로 한다.The inference engine determines whether the processed user input recognition information is fusionable and fusion necessary information with reference to the modality combination rule information and the pre-generated action, and if the information is fusionable and necessary, infer the new action. It characterized in that to perform.

바람직하게 상기 멀티모달 융합 장치는, 상기 추론엔진부로부터 생성되는 액션들을 시간 순서에 따라 저장하고, 상기 추론엔진부의 추론을 위한 정보로 기 저장된 액션들을 제공하는 결과저장부를 더 포함한다.Preferably, the multi-modal fusion device further includes a result storage unit for storing the actions generated from the inference engine in chronological order and providing pre-stored actions as information for inference of the inference engine.

바람직하게 상기 멀티모달 융합 장치는, 상기 추론엔진부로부터 액션이 생성되면 해당 액션의 타당 여부를 검증하여, 해당 액션이 타당하다고 판단되는 경우 해당 액션을 시스템 입력으로 전달하고, 해당 액션이 타당하지 않다고 판단되는 경우 액션 오류 정보를 발생시키는 검증부를 더 포함하는 것을 특징으로 한다. Preferably, the multi-modal fusion device verifies whether the action is valid when the action is generated from the inference engine unit, and if the action is determined to be valid, delivers the action to a system input and says that the action is not valid. If determined, characterized in that it further comprises a verification unit for generating action error information.

바람직하게 상기 멀티모달 융합 장치는, 상기 발생되는 액션 오류 정보를 사용자에게 알리기 위한 정보로 변환하여 시스템 출력으로 전달하는 피드백 생성부를 더 포함하는 것을 특징으로 한다.Preferably, the multi-modal fusion device, characterized in that it further comprises a feedback generation unit for converting the generated action error information to the information for informing the user to deliver to the system output.

상기와 같은 본 발명의 목적을 달성하기 위한 본 발명의 멀티모달 융합 방법은, 시스템 초기 시, 모달리티 조합 규칙 정보를 파싱하여 저장하는 단계; 단일 또는 다종의 사용자 입력 인식 정보가 해당 인식기로부터 입력되면, 이를 액션 추론 가능한 정보로 가공하여 저장하는 단계; 상기 모달리티 조합 규칙 정보를 참조하여 상기 가공된 사용자 입력 인식 정보에 대응하는 액션을 추론하는 단계를 포함한다.The multi-modal fusion method of the present invention for achieving the above object of the present invention, parsing and storing the modality combination rule information at the initial stage of the system; If single or multiple types of user input recognition information is input from the corresponding recognizer, processing the same as action inferable information and storing the same; Inferring an action corresponding to the processed user input recognition information with reference to the modality combination rule information.

상기 파싱 단계는, XML 형태의 모달리티 조합 규칙 정보를 파싱하는 것을 특징으로 한다.The parsing step may include parsing modality combination rule information in an XML format.

상기 액션을 추론하는 단계는, 상기 모달리티 조합 규칙 정보 및 기 생성된 액션을 참조하여, 상기 가공된 사용자 입력 인식 정보가 융합 가능 및 필요 정보인지 여부를 순차적으로 파악한 후, 상기 융합 가능하고 필요한 정보로 판단되는 경우 상기 액션을 추론하는 것을 특징으로 한다.The inferring of the action may be performed by referring to the modality combination rule information and the pre-generated action, and sequentially determining whether the processed user input recognition information is fusionable and necessary information, and then convert the information into the fusionable and necessary information. If it is determined, the action is inferred.

바람직하게 상기 멀티모달 융합 방법은, 상기 추론된 액션의 타당 여부를 검증하는 단계; 상기 검증결과 상기 추론된 액션이 타당하다고 판단되는 경우, 상기 액션을 시스템 입력으로 전달하고, 타당하지 않다고 판단되는 경우, 액션 오류 정보를 발생시키는 단계; 상기 발생시킨 액션 오류 정보를 사용자에게 알리기 위한 정보로 변환하여 시스템 출력으로 전달하는 단계를 더 포함하는 것을 특징으로 한다.Advantageously, the multimodal fusion method comprises: validating the inferred action; If it is determined that the inferred action is valid, passing the action to a system input and generating action error information if it is determined to be invalid; The method may further include converting the generated action error information into information for informing a user and transferring the generated action error information to a system output.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있는 바람직한 실시 예를 상세히 설명한 다. 다만, 본 발명의 바람직한 실시 예에 대한 동작 원리를 상세하게 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. However, in describing in detail the operating principle of the preferred embodiment of the present invention, if it is determined that the detailed description of the related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

또한, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용한다.In addition, the same reference numerals are used for parts having similar functions and functions throughout the drawings.

후술되는 본 발명은, 다종의 모달리티 인식 엔진을 포함하는 시스템의 멀티모달 인터페이스 장치에 구성되어, 단일 또는 다종의 모달리티 입력을 융합하여 시스템 명령어로 추론하는 것이다.The present invention described below is configured in a multimodal interface device of a system including various types of modality recognition engines and infers a single or multiple types of modality inputs into system commands.

도 1은 본 발명의 바람직한 일 실시예에 따른 멀티모달 융합기를 나타낸 블록도이다. 1 is a block diagram showing a multi-modal fusion device according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 멀티모달 융합기(100)는 입력처리부(110), 추론엔진부(120), 결과저장부(130), 검증부(140), 규칙저장부(150) 및 피드백 생성부(170)를 포함할 수 있다.As shown in FIG. 1, the multimodal fusion machine 100 includes an input processor 110, an inference engine 120, a result storage unit 130, a verification unit 140, a rule storage unit 150, and feedback. It may include a generation unit 170.

이와 같은 구성을 갖는 멀티모달 융합기(100)에서 입력처리부(110)는 음성, 제스처 등의 사용자 입력에 대응되는 인식 정보를 해당 인식기(210, 220)를 통해 제공받아, 이를 추론엔진부(120)가 사용할 수 있도록 가공하는 기능을 수행한다. In the multi-modal fusion apparatus 100 having such a configuration, the input processing unit 110 receives recognition information corresponding to a user input such as a voice or a gesture through corresponding recognizers 210 and 220, and infers this from the induction engine 120. It performs the function of processing for use.

즉, 입력처리부(110)는 음성 인식기(210), 제스처 인식기(220) 등을 통해 제공받는 사용자 입력 인식 정보를 입력 시작 이벤트, 입력 종료 이벤트 및 결과값 이벤트가 포함된 모달리티 입력 정보로 가공하는 기능을 수행한다. 여기서, 입력 시작 이벤트는 사용자 입력의 시작점을 포함하고, 입력 종료 이벤트는 사용자 입력 의 종료점을 포함하며, 결과값 이벤트는 사용자 입력의 인식 결과값을 포함한다. That is, the input processing unit 110 processes the user input recognition information provided through the voice recognizer 210, the gesture recognizer 220, and the like into modality input information including an input start event, an input end event, and a result value event. Do this. Here, the input start event includes the start point of the user input, the input end event includes the end point of the user input, and the result value event includes the recognition result value of the user input.

추론엔진부(120)는 멀티모달 융합기(100)에 입력된 음성, 제스처 등의 단일 또는 다종의 사용자 입력이 무엇을 의미하는지를 판단하여, 판단에 대응되는 시스템 명령어를 생성하는 기능을 수행한다. 즉, 추론엔진부(120)는 입력처리부(110)로부터 입력되는 모달리티 입력 시작 이벤트, 입력 종료 이벤트 및 결과값 이벤트에 따라 판단되는 단일 또는 다종의 사용자 입력을 단일 또는 융합된 상태의 시스템 명령어로 변환한다.The inference engine unit 120 determines what single or multiple user inputs such as voice and gesture input to the multi-modal fusion apparatus 100 mean and generates a system command corresponding to the determination. That is, the inference engine unit 120 converts a single or multiple user inputs determined according to the modality input start event, input end event, and result value event input from the input processing unit 110 into system commands in a single or fused state. do.

이와 같은 추론엔진부(120)는 본 발명에 따라 기 설정되는 모달리티 조합 규칙 정보 및 기 생성된 시스템 명령어를 참조하여 현재 사용자 입력에 대응되는 시스템 명령어를 추론한다. The reasoning engine unit 120 infers the system command corresponding to the current user input with reference to the modality combination rule information and the previously generated system command.

한편, 본 발명은 추론엔진부(120)에 따라 추론되는 시스템 명령어를 액션(ACTION)이라 칭하도록 한다. 여기서, 액션은 모달리티 융합기를 적용하고자 하는 시스템에 따라 다양하게 정의될 수 있다. 일예로, 액션은 미디어 재생기를 제어하고자 하는 시스템에 적용되는 경우, 정지(STOP), 재생(PLAY), 일시정지(PAUSE) 등으로 정의될 수 있으며, 시스템과 시스템간의 데이터 전송을 수행하는 시스템에 적용되는 경우, 선택(SELECT), 이동(MOVE), 삭제(DELETE) 등으로 정의될 수 있다.Meanwhile, in the present invention, a system command inferred by the inference engine unit 120 is called an action. Here, the action may be defined in various ways according to the system to apply the modality fusion device. For example, when the action is applied to a system that wants to control the media player, the action may be defined as STOP, PLAY, PAUSE, and the like. If applied, it may be defined as SELECT, MOVE, DELETE, or the like.

결과저장부(130)는 추론엔진부(120)로부터 추론된 액션들을 시간 순서대로 저장하며, 추론엔진부(120)의 추론 과정에 기 저장된 액션들을 제공하는 기능을 수행한다.The result storage unit 130 stores the actions inferred from the inference engine unit 120 in order of time, and performs a function of providing pre-stored actions in the inference process of the inference engine unit 120.

검증부(140)는 추론엔진부(120)로부터 추론된 액션을 검증하는 기능을 수행 한다. 이와 같은 검증부(140)는 임의의 액션에 대한 검증 중 오류가 발생되면 해당 액션에 대한 오류 정보를 피드백 생성부(170)로 전달하고, 오류 발생 없이 검증이 종료되면 해당 액션을 시스템 입력으로 전달한다.The verification unit 140 performs a function of verifying the action inferred from the inference engine unit 120. The verification unit 140 transmits error information on the corresponding action to the feedback generator 170 when an error occurs during the verification of any action, and transfers the corresponding action to the system input when the verification ends without generating an error. do.

피드백 생성부(170)는 검증부(140)로부터 전달되는 액션 오류 정보를 사용자에게 알리기 위한 방법으로 정의하여 시스템 출력으로 전달하는 기능을 수행한다. The feedback generator 170 defines the action error information transmitted from the verifier 140 as a method for notifying the user and delivers the action error information to the system output.

규칙저장부(150)는 시스템으로부터 모달리티 조합 규칙을 파싱하여 저장하고, 추론엔진부(120)에 액션 추론을 위한 모달리티 조합 규칙 정보를 제공하는 기능을 수행한다. The rule storage unit 150 parses and stores the modality combination rules from the system, and provides the inference engine unit 120 with modality combination rule information for action inference.

이에 시스템은 멀티모달 융합기(100)로 제공하기 위한 모달리티 조합 규칙을 기 설정하여 저장하며, 저장된 모달리티 조합 규칙이 시스템 특성에 따라 사용자 및 생산자에 의해 추가 변경되도록 할 수 있다. 즉, 시스템은 새로운 입력 모달리티가 추가되거나, 시스템의 확장 및 보수가 발생하는 경우를 대비하여, 모달리티 조합 규칙이 용이하게 추가 변경되도록 하는 것이다.Accordingly, the system may preset and store modality combination rules for providing to the multimodal fusion apparatus 100, and may allow the stored modality combination rules to be further changed by the user and the producer according to system characteristics. That is, the system allows the modality combination rule to be easily changed in case a new input modality is added or expansion and repair of the system occurs.

한편, 본 발명은 모달리티 조합 규칙이 일 실시예로 XML 형태의 문서로 정의되도록 하되, 본 발명에 따른 XML은 음성, 제스처 등과 같은 모든 종류의 모달리티 입력을 인식하고 지원함에 따라 ActionXML이라 정의하도록 한다. 또한, 본 발명은 다른 실시예로 XML 형태가 아닌 다른 마크업언어(Markup Language) 형태의 문서 또는 마크업언어가 아닌 다른 언어 형태의 문서로 정의되도록 할 수 있으나, 본 발명은 XML에 대해서만 기술하도록 한다.Meanwhile, the present invention allows a modality combination rule to be defined as an XML document in one embodiment, but XML according to the present invention is defined as ActionXML as it recognizes and supports all kinds of modality inputs such as voice and gesture. Also, in another embodiment, the present invention may be defined as a document in a markup language other than XML or a document in a language other than a markup language, but the present invention is described only for XML. do.

표 1은 본 발명의 일 실시예에 따라 정의된 ActionXML의 엘러먼트들을 나타 낸다.Table 1 shows the elements of ActionXML defined according to an embodiment of the present invention.

[표 1]TABLE 1

ElementElement 설명Explanation 비고Remarks adxmladxml - ActionXML 엘러먼트ActionXML element actionaction - 하나의 액션을 정의한 엘러먼트로 최상위 엘러먼트 An element that defines an action, the highest element ※자식 엘러먼트 - input, integration※ Child element-input, integration inputinput - 해당 액션에 입력으로 올 수 있는 모달리티들을 나열한 엘러먼트An element that lists the modalities that can be entered as input to the action ※자식 엘러먼트 -command※ Child element -command integrationintegration - 해당 액션을 위해서 input에 나열된 입력 모달리티들의 의미 있는 조합 방법을 제시한 엘러먼트An element that suggests a meaningful combination of the input modalities listed in the input for the action. ※자식 엘러먼트 -or, and※ Child element -or, and commandcommand - input의 자식 엘러먼트로 해당 모달리티의 명령어들을 나열한 엘러먼트an element that lists the commands of that modality as a child element of input ※자식엘러먼트 -item※ Child element -item itemitem - 모달리티 명령어 리스트Modality Instruction List oror - integration 엘러먼트의 자식 엘러먼트로 모달리티 조합의 규칙을 나타내기 위해서 사용되며, 자식 엘러먼트로 오는 것들 중 하나의 조건만 만족시키면 되는 것을 나타내는 엘러먼트An element element of the integration element, used to represent the rules of modality combinations, and an element indicating that only one of the conditions coming into the child element needs to be satisfied. ※자식 엘러먼트 -modality, or, and※ Child element -modality, or, and andand - integration 엘러먼트의 자식 엘러먼트로 모달리티 조합의 규칙을 나타내기 위해서 사용되며, 자식 엘러먼트로 오는 것들 모두를 포함해야 하는 것을 나타내는 엘러먼트 A child element of the integration element, used to represent the rules of modality combinations, and an element indicating that it must contain all of the elements coming into the child element. ※자식 엘러먼트 -modality, or, and※ Child element -modality, or, and modalitymodality - input 엘러먼트의 command 중 하나를 지칭하는 엘러먼트an element that refers to one of the commands of the input element setset - action 엘러먼트의 속성에 따라 여러개의 액션을 조합하기 위한 엘러먼트로, sequence 엘러먼트를 자식 엘러먼트로 가지고 있음-An element that combines several actions according to the property of the action element. It has a sequence element as a child element. ※자식 엘러먼트 -sequence, time※ Child element -sequence, time sequencesequence - set 엘러먼트의 자식 엘러먼트로, action 엘러먼트의 속성에 따라 사용자의 입력으로 들어온 액션의 순서를 정의하는 엘러먼트A child element of the set element, which defines the order of actions coming into the user's input based on the attributes of the action element. ※자식 엘러먼트 -actionname ※ Child element -actionname timetime - set 엘러먼트의 자식 엘러먼트로, set 엘러먼트 아래에 나열된 액션들의 의미있는 최대 입력 시간-The meaningful maximum input time of the actions listed under the set element, as a child of the set element. actionnameactionname - sequence 엘러먼트의 자식 엘러먼트로 각각의 액션을 가리키는 엘러먼트an element that points to each action as a child element of the sequence element

상기 표 1과 같이 ActionXML은 adxml, action, input, integation, command, item, or, and, modality, set, sequence, time, actionname 등과 같은 엘러먼트들을 정의하여 포함하고, 모달리티 조합 규칙들이 상기 엘러먼트들을 이용한 ActionXML 문서로 이루어지도록 한다.As shown in Table 1, ActionXML defines and includes elements such as adxml, action, input, integation, command, item, or, and, modality, set, sequence, time, actionname, and the like. It consists of used ActionXML document.

하기 표 2는 ActionXML 엘러먼트들의 속성을 일 실시예로 나타내고 있다.Table 2 below shows, as an example, attributes of ActionXML elements.

[표 2]TABLE 2

ElementElement AttributeAttribute adxmladxml - version(옵션): ActionXML의 버전 값을 가진다 - encoding(옵션):문서의 문자 인코딩 값을 가진다version (optional): holds the version value of ActionXML-encoding (optional): holds the character encoding of the document actionaction - name(필수): action의 이름 - type(옵션): 'multi', 'single' 값을 가질 수 있으며, 디폴트 값은 single 이다 - subname(옵션): 같은 이름의 액션을 구분하기 위해서 사용하는 또 다른 이름 - send(옵션): 시스템의 입력으로 사용될 수 있는 여부를 나타내는 속성. 'yes', 'no' 값을 가진다. 디폴트 값은 'yes'.-name (required): the name of the action-type (optional): may have a value of 'multi', 'single', the default value is single-subname (optional): used to distinguish actions of the same name Alternate name-send (optional): Attribute indicating whether it can be used as input to the system. It has 'yes' and 'no' values. The default value is 'yes'. commandcommand - name(필수): 이름 속성 값. - modality 엘러먼트에서 이 이름 값으로 command를 지칭할 수 있음 - mode(필수): 입력 모달리티 모드. 'voice', 'gesture' 등의 값을 가짐name (required): The name attribute value. In the modality element, this name can refer to command. mode (mandatory): The input modality mode. has values such as 'voice', 'gesture', etc. oror - weight(옵션): 웨이팅 값 계산을 위한 가중치 값weight (optional): a weight value for calculating the weight value. andand - weight(옵션): 웨이팅 값 계산을 위한 가중치 값weight (optional): a weight value for calculating the weight value. modalitymodality - weight(옵션): 웨이팅 값 계산을 위한 가중치 값 - c-name(필수): command 엘러먼트의 name을 지칭weight (optional): a weight value for calculating the weighting value. c-name (mandatory). The name of the command element. sequencesequence - value(필수): 순서를 나타내는 숫자값을 가진다.value (required): Takes a numeric value indicating the order. timetime - value(필수): 임계 시간 값을 가진다 - 단위: milisecondvalue (required): has a threshold time value-unit: milisecond actionnameactionname - a-name(필수): 액션을 지칭하는 값 - a-subname(옵션): 액션의 부 명칭을 지칭하는 값a-name (mandatory): a value indicating the action-a-subname (optional): a value indicating the subname of the action

다음으로, 이와 같은 구성을 갖는 멀티모달 융합기(100)의 동작 흐름에 대해 자세히 살펴보면, 먼저 시스템이 초기화 되는 경우, 멀티모달 융합기(100)의 입력처리부(110)는 적어도 하나 이상의 인식기(210, 220)로부터 기 입력받아 저장중인 사용자 입력 인식 정보들을 삭제하여 자신을 초기화시키고, 규칙저장부(150)는 ActionXML 파일을 파싱하여 저장한다.Next, the operation flow of the multi-modal fusion apparatus 100 having such a configuration will be described in detail. First, when the system is initialized, the input processing unit 110 of the multi-modal fusion apparatus 100 includes at least one recognizer 210. 220 and initializes itself by deleting the user input recognition information being stored from the pre-input, and the rule storage unit 150 parses and stores the ActionXML file.

초기화된 입력처리부(110)는 음성, 제스처 등의 사용자 입력에 대응되는 인식 정보가 해당 인식기(210, 220)로부터 입력되면 이를 모달리티 시작 이벤트, 입력 종료 이벤트 및 결과값 이벤트가 포함된 모달리티 입력 정보로 가공하여 추론엔진부(120)로 출력한다.The initialized input processing unit 110 is a modality input information including a modality start event, an input end event and a result value event when recognition information corresponding to a user input such as a voice or a gesture is input from the corresponding recognizers 210 and 220. The process outputs to the inference engine unit 120.

이에 추론엔진부(120)는, 규칙저장부(150)의 모달리티 조합 규칙 정보를 참 조하여 입력된 모달리티 입력 정보가 융합 가능한 정보인지 여부와 융합이 필요한 정보인지 여부를 순차적으로 파악한다. 즉, 추론엔진부는 단일 또는 다종의 사용자 입력이 모달리티 조합 규칙에 따라 액션 추론 가능한 입력인지 여부 및 액션 추론 필요한 입력인지 여부를 파악하는 것이다. The reasoning engine 120 refers to the modality combination rule information of the rule storage unit 150 to sequentially determine whether the input modality input information is fusionable information and whether the fusion is necessary information. That is, the inference engine unit determines whether the single or multiple user inputs are inputs that can be inferred action according to the modality combination rule and whether the inputs are required for action inference.

그리고, 추론엔진부(120)는 파악결과 입력된 모달리티 입력 정보가 융합 가능한 정보 및 융합 필요한 정보로 판단되는 경우, 규칙저장부(150)의 모달리티 조합 규칙 정보 및 결과저장부(130)의 기존 액션 정보들 참조하여 새로운 액션을 추론한다. 하지만, 추론엔진부(120)는 입력된 모달리티 입력 정보가 융합 불가능한 정보이거나, 융합 불필요한 정보로 판단되는 경우 액션 추론없이 새로운 모달리티 입력 정보가 입력될 때까지 동작을 중지한다.In addition, when the inference engine unit 120 determines that the modality input information inputted as a result of the determination is fusionable information and necessary information, the existing action of the modality combination rule information of the rule storage unit 150 and the result storage unit 130 is determined. Infer new actions from the information. However, the inference engine unit 120 stops the operation until new modality input information is input without action inference when the input modality input information is non-fusion information or is determined to be unnecessary convergence information.

검증부(140)는 추론엔진부(120)로부터 새로운 액션이 입력되면, 입력된 액션 검증을 수행하여, 검증결과 입력된 액션이 타당하지 않은 것으로 판단되거나, 검증 중 오류가 발생되는 경우 액션 오류 정보를 피드백 생성부(170)로 출력하고, 검증결과 입력된 액션이 타당한 것으로 판단되는 경우 해당 액션을 시스템 입력으로 전송한다.When a new action is input from the inference engine unit 120, the verification unit 140 performs the input action verification, and determines that the input action is invalid as a result of the verification, or when an error occurs during verification, action error information. The control unit outputs the to the feedback generator 170, and transmits the corresponding action to the system input when it is determined that the input action is valid.

이에 시스템은 시스템 입력으로 전달된 액션을 파악하여 해당 액션에 대응되는 출력을 사용자에게 제공할 수 있게 된다.Accordingly, the system can grasp the action delivered to the system input and provide the user with an output corresponding to the action.

한편, 피드백 생성부(170)는 검증부(140)로부터 액션 오류 정보가 입력되면, 입력된 액션 오류 정보를 사용자에게 알리기 위한 방법으로 정의한 후 시스템 출력으로 전달하여, 사용자가 사용자 입력에 문제가 발생하였음을 확인하도록 한다. Meanwhile, when action error information is input from the verification unit 140, the feedback generator 170 defines the input action error information as a method for informing the user, and then delivers the action error information to the system output, thereby causing a problem in the user input. Make sure you do it.

다음으로, ActionXML 문서를 통해 기 설정되는 모달리티 규칙 정보를 실시예를 통해 살펴보고, 이를 이용하여 액션이 추론되는 방법에 대해 살펴보도록 한다.Next, look at the modality rule information that is set through the ActionXML document through the embodiment, and look at how the action is inferred using this.

<?xml version="1.0" encoding="ksc5601"?><? xml version = "1.0" encoding = "ksc5601"?>

<item>재생</item> <item> Playback </ item>

</command></ command>

<item>Circle</item><item> Circle </ item>

</command></ command>

<item>ToRight</item><item> ToRight </ item>

<item>SpinRight</item> <item> SpinRight </ item>

</command> </ command>

</input></ input>

</and></ and>

</and></ and>

</or></ or>

</integration></ integration>

</action></ action>

상기 ActionXML 문서는, 일 실시예로 "실행", "재생" 또는 "플레이"라는 음성과, "Circle", "ToRight" 또는 "SpinRight"로 각각 정의되는 제스처가 설정된 조합에 따라 사용자로부터 입력되는 경우, "PLAY"라는 액션이 추론되도록 하는 모달 리티 조합 규칙을 제시하고 있다.In one embodiment, the ActionXML document is input from the user according to a combination of a voice defined as “execute”, “play” or “play”, and a gesture defined as “Circle”, “ToRight” or “SpinRight”, respectively. In addition, we present a modality combination rule that allows an action called "PLAY" to be inferred.

자세히 살펴보면, 상기 ActionXML 문서는 input 엘러먼트를 이용하여 "PLAY" 액션이 추론되기 위한 입력 모달리티로 voice, gesture1 및 gesture2를 나열하고, 나열된 입력 모달리티 중 voice 명령어로는 "실행", "재생" 및 "플레이"를, gesture1 명령어로는 "Circle"를, gesture2 명령어로는 "ToRight" 및 "SpinRight"를 item 엘러먼트로 정의하며, 상기 정의된 입력 모달리티들의 조합이 integration 엘러먼트를 통해 정의되도록 하고 있다.In detail, the ActionXML document lists voice, gesture1, and gesture2 as input modalities for inferring a "PLAY" action using an input element, and the voice commands among the listed input modals are "execute", "play", and " Play "," Circle "as the gesture1 command," ToRight "and" SpinRight "as the gesture2 command are defined as item elements, and the combination of the input modalities defined above is defined through an integration element.

즉, "PLAY"는 상기 ActionXML 문서에 따라 입력 모달리티로 정의된 voice1, gesture1 및 gesture2 중에서 상기 voice1과 gesture1 각각이 별도로 입력되는 경우나, voice1과 gesture1이 조합되어 입력되는 경우나, voice1과 gesture2가 조합되어 입력되는 경우, 시스템 명령어로 추론될 수 있는 것이다.That is, "PLAY" is a case where voice1 and gesture1 are separately input from voice1, gesture1 and gesture2 defined as input modalities according to the ActionXML document, or when voice1 and gesture1 are input in combination, or voice1 and gesture2 are combined. If entered, it can be inferred as a system command.

이에 따라 사용자는 마이크 및 터치스크린이 구비된 시스템에서 "실행", "재생" 또는 "플레이"의 음성을 입력하거나, "실행", "재생" 또는 "플레이"의 음성 입력과 함께 "Circle", "ToRight" 또는 "SpinRight" 에 대응되는 제스처를 수행하여, 자신이 임의의 영상, 음악 등에 대한 실행을 요청함을 시스템에 알릴 수 있다.Accordingly, in a system equipped with a microphone and a touch screen, a user may input a voice of "play", "play" or "play", or "circle", with a voice input of "play", "play" or "play". By performing a gesture corresponding to "ToRight" or "SpinRight", the system may be informed that the user requests execution of any image, music, or the like.

한편, 상기 ActionXML 문서 즉, 모달리티 조합 규칙은 modality 엘러먼트를 통해 voice1과 gesture1의 조합이 최상의 신뢰성을 가지도록 정의하고, voice1과 gesture2의 조합이 최하의 신뢰성을 가지도록 정의하고 있다.On the other hand, the ActionXML document, that is, the modality combination rule defines that the combination of voice1 and gesture1 has the highest reliability through the modality element, and the combination of voice1 and gesture2 has the lowest reliability.

하기 ActionXML 문서는 다른 실시예로 "SELECTOBJECT", "SELECTOBJECT(FROM)", "SELECTOBJECT(TO)", "MOVE" 및 "MOVE(TOOBJEC)"를 추론하기 위한 모달리티 조합 규칙들을 나타내고 있으며, 이 또한 상기 "PLAY"를 추론하기 위한 ActionXML 문서와 동일하게 구성된다.The ActionXML document below illustrates modality combination rules for inferring "SELECTOBJECT", "SELECTOBJECT (FROM)", "SELECTOBJECT (TO)", "MOVE" and "MOVE (TOOBJEC)" in another embodiment, which is also described above. It is configured in the same way as the ActionXML document for inferring "PLAY".

즉, 하기 ActionXML 문서는, "SELECTOBJECT"에 대하여 "이것" 또는 "선택"의 voice1과 "저것" 또는 "선택"의 voice2가 각각 독립적으로 입력되는 경우나, voice1 또는 voice2와 "Pointing"의 gesture1이 조합되어 입력되는 경우에 시스템 명령어로 추론되도록 하고, "SELECTOBJECT(FROM)"에 대하여는 "이것을" 또는 "저것을"의 voice1과 "Pointing"의 gesture1이 조합되어 입력되는 경우에 시스템 명령어로 추론되도록 하며, "SELECTOBJECT(TO)"에 대하여는 "여기로", "이곳으로", "저기로" 또는 "저곳으로"의 voice1과 "Pointing"의 gesture1이 조합되어 입력되는 경우에 시스템 명령어로 추론되도록 하고 있다.That is, in the following ActionXML document, voice1 of "this" or "selection" and voice2 of "that" or "selection" are respectively independently input to "SELECTOBJECT", or voice1 or voice2 and gesture1 of "Pointing" When inputted in combination, it is inferred as a system command, and for "SELECTOBJECT (FROM)", it is inferred as a system command when the voice1 of "this" or "that" and the gesture1 of "Pointing" are input. For "SELECTOBJECT (TO)", the voice1 of "here", "here", "here", or "here" and the gesture1 of "Pointing" are combined to be inferred as a system command. .

또한, 하기 ActionXML 문서는, "MOVE"에 대하여 "이동해" 또는 "이동"의 voice1만이 입력되는 경우에 시스템 명령어로 추론되도록 하고, "MOVE(TOOBJEC)"에 대하여는 "이동해" 또는 "이동"의 voice1과 "Pointing"의 gesture1이 조합되어 입력되는 경우에만 시스템 명령어로 추론되도록 하고 있다. In addition, the following ActionXML document is inferred as a system command when only the voice1 of "Move" or "Move" is input for "MOVE", and the voice1 of "Move" or "Move" for "MOVE (TOOBJEC)". And gesture1 of "Pointing" are inferred as system commands only when input is combined.

<item>선택</item><item> Choice </ item>

</command></ command>

<item>선택</item> <item> Choice </ item>

</command> </ command>

<item>Pointing</item><item> Pointing </ item>

</command></ command>

</input></ input>

</and></ and>

</and></ and>

</or> </ or>

</integration></ integration>

</action></ action>

</command> </ command>

<item>Pointing</item><item> Pointing </ item>

</command></ command>

</input></ input>

</and></ and>

</or> </ or>

</integration></ integration>

</action></ action>

<item>저기로</item><item> Over there </ item>

<item>저곳으로</item><item> Over there </ item>

</command> </ command>

<item>Pointing</item><item> Pointing </ item>

</command></ command>

</input></ input>

</and></ and>

</or> </ or>

</integration></ integration>

</action></ action>

</command></ command>

</input></ input>

</or> </ or>

</integration></ integration>

</action></ action>

</command></ command>

<item>Pointing</item><item> Pointing </ item>

</command></ command>

</input></ input>

</and></ and>

</or> </ or>

</integration> </ integration>

</action></ action>

한편, 하기 ActionXML 문서는 또 다른 실시예의 액션을 추론하기 위한 모달리티 조합 규칙으로, 전술한 ActionXML 문서들의 액션을 참조하여 새로운 액션인 "MOVE2OBJECT"가 추론되도록 하고 있다.Meanwhile, the following ActionXML document is a modality combination rule for inferring an action of another embodiment, and a new action "MOVE2OBJECT" is inferred by referring to the actions of the above-described ActionXML documents.

즉, 하기 ActionXML 문서는 적어도 하나 이상의 액션이 연속적으로 입력되는 경우 추론될 수 있는 액션에 대한 모달리티 조합 규칙을 나타낸 일예인 것이다.That is, the following ActionXML document is an example showing a modality combination rule for an action that can be inferred when at least one action is continuously input.

자세히 살펴보면, 하기 ActionXML 문서는 action 엘러먼트 속성을 "multi"로 정의한 후, set 엘러먼트 및 sequence 엘러먼트를 이용하여 전술한 "SELECTOBJECT(FROM)", "SELECTOBJECT(TO)" 및 "MOVE" 액션에 대응되는 사용자 모달리티가 순차적으로 입력되는 경우나, "SELECTOBJECT(FROM)" 및 "MOVE(TOOBJEC)" 액션에 대응되는 사용자 모달리티가 순차적으로 입력되는 경우에 "MOVE2OBJECT" 액션이 추론되도록 하고 있다.In detail, the following ActionXML document defines the action element attribute as "multi", and then uses the set element and the sequence element to perform the "SELECTOBJECT (FROM)", "SELECTOBJECT (TO)", and "MOVE" actions. The "MOVE2OBJECT" action is inferred when the corresponding user modalities are sequentially input or when the user modalities corresponding to the "SELECTOBJECT (FROM)" and "MOVE (TOOBJEC)" actions are sequentially input.

따라서, 멀티모달 융합기(100)의 추론엔지부(120)는 규칙저장부(150)의 모달리티 조합 규칙 정보와 더불어 결과저장부(130)의 기존 액션 정보를 참조하여 현재 사용자 입력에 대응되는 액션을 추론하는 것이다.Accordingly, the inference engine unit 120 of the multi-modal fusion apparatus 100 refers to the action information corresponding to the current user input by referring to the existing action information of the result storage unit 130 together with the modality combination rule information of the rule storage unit 150. To infer.

여기서, 멀티모달 융합기(100) 추론엔진부(120)가 기 설정된 시간 이내에 생성된 액션만을 참조하여 현재 사용자 입력에 대응되는 액션을 추론하도록 한다.Here, the inference engine unit of the multi-modal fusion apparatus 100 may infer an action corresponding to the current user input with reference to only an action generated within a preset time.

</sequence></ sequence>

</sequence></ sequence>

</sequence></ sequence>

</set></ set>

</sequence></ sequence>

</sequence></ sequence>

</set> </ set>

</integration></ integration>

</action></ action>

</adxml></ adxml>

상기 실시예들을 통해 살펴본 바와 같이, 본 발명에 따른 멀티모달 융합기(100)는 적어도 하나 이상의 액션을 추론하기 위하여 ActionXML 문서의 모달리티 조합 규칙들을 시스템으로부터 파싱하여 저장한 후, 저장된 모달리티 조합 규칙 정보 및 기 생성한 액션 정보를 참조하여 단일 또는 다종의 사용자 입력에 따른 액션을 추론함을 확인할 수 있다.As described through the above embodiments, the multimodal fusion apparatus 100 according to the present invention parses and stores the modality combination rules of an ActionXML document from the system to infer at least one or more actions, and then stores the stored modality combination rule information and By referring to the generated action information, it can be seen that the action is inferred according to single or multiple user inputs.

다음으로, 단일 또는 다종의 사용자 입력에 대한 멀티모달 융합기(100)의 모달리티 융합 방법을 첨부한 도면을 참조하여 살펴보도록 한다.Next, the modality fusion method of the multi-modal fusion device 100 for a single or multiple user inputs will be described with reference to the accompanying drawings.

도 2는 본 발명의 바람직한 일 실시예에 따른 멀티모달 융합기(100)의 모달리티 융합 방법을 나타낸 순서 흐름도이다.2 is a flow chart illustrating a modality fusion method of the multi-modal fusion device 100 according to an embodiment of the present invention.

도 2를 참조하면, 멀티모달 융합기(100)는 시스템에 초기 전원이 인가되거나, 동작 중인 시스템이 재부팅되는 경우, 기 입력되어 저장된 사용자 모달리티 입력 정보들을 삭제하고, ActionXML 문서로 구성되는 모달리티 규칙 정보를 파싱하여 저장한다(S101).Referring to FIG. 2, when the initial power is applied to the system or when the operating system is rebooted, the multi-modal fuser 100 deletes user input and stored user modality input information and modality rule information configured as an ActionXML document. Pars and store it (S101).

그리고, 멀티모달 융합기(100)는 단일 또는 다종의 인식기를 통해 음성, 제스처 등의 사용자 입력 인식 정보가 입력되었는지 여부를 확인한다(S102).The multi-modal fusion apparatus 100 checks whether user input recognition information such as voice or gesture is input through a single or multiple kinds of recognizers (S102).

멀티모달 융합기(100)는 확인결과, 해당 인식기로부터 사용자 입력 인식 정보가 입력되었으면 이를 액션 추론 가능한 모달리티 입력 정보로 가공하여 저장한 후(S103), 파싱된 모달리티 조합 규칙 정보 및 기 생성한 액션 정보를 참조하여 해 당 모달리티 입력 정보가 융합 가능한 정보인지 여부 및 융합 필요한 정보인지 여부를 순차적으로 파악한다(S104, S105). When the multimodal fusion apparatus 100 confirms that the user input recognition information is input from the corresponding recognizer, the multimodal fusion apparatus 100 processes and stores the modality input information that can be inferred as action inference (S103). The parsed modality combination rule information and the pre-generated action information Reference is made to determine whether the corresponding modality input information is information that can be converged and whether information is required for convergence (S104, S105).

멀티모달 융합기(100)는 파악결과, 모달리티 입력 정보가 융합 가능 및 융합 필요 정보로 판단되면, 모달리티 조합 규칙 정보 및 기 생성한 액션 정보를 참조하여 해당 사용자 입력에 대응되는 액션을 추론한다(S106).If the multimodal fusion apparatus 100 determines that the modality input information is fusionable and necessary fusion information, the multi-modal fusion apparatus 100 infers the action corresponding to the user input with reference to the modality combination rule information and the pre-generated action information (S106). ).

하지만, 멀티모달 융합기(100)는 모달리티 입력 정보가 융합 불가능한 정보이거나, 융합 불필요한 정보로 판단되면, 해당 모달리티 입력 정보에 대한 액션 추론을 중지하고, 단일 또는 다종의 인식기로부터 새로운 사용자 입력 인식 정보가 입력되는지 여부를 확인한다(S102).However, if the modality input information is non-fused information or unnecessary convergence information, the multi-modal fusion apparatus 100 stops inferring action on the modality input information, and new user input recognition information is received from a single or multiple types of recognizers. Check whether it is input (S102).

한편, 임의의 모달리티 입력 정보에 대한 액션을 추론한 멀티모달 융합기(100)는, 추론된 액션을 다시 한번 검증하여 해당 액션의 타당 여부를 확인하고(S107), 타당한 경우 추론된 액션을 시스템 입력으로 전달한다(S108). On the other hand, the multi-modal fusion apparatus 100 inferring the action for any modality input information, verify the validity of the inferred action once again (S107), and if appropriate, input the inferred action system input Transfer to (S108).

하지만, 멀티모달 융합기(100)는 검증결과 추론된 액션이 타당하지 않은 것으로 판단되거나, 검증 중 오류가 발생되는 경우, 해당 액션에 대응되는 사용자 입력이 잘못된 입력임을 사용자에게 알리기 위한 방법을 정의한 후 시스템 출력으로 전달하여, 이를 확인한 사용자로부터 사용자 입력이 재 수행 되도록 한다(S109).However, the multi-modal fusion apparatus 100 defines a method for informing the user that a user input corresponding to the action is an invalid input when it is determined that an inferred action is not valid or an error occurs during verification. Transfer to the system output, so that the user input is performed again from the user who confirmed this (S109).

이상에서 설명한 본 발명은 전술한 실시 예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경할 수 있다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식 을 가진 당업자에게 있어 명백할 것이다. The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and it is common in the art that various substitutions, modifications, and changes can be made without departing from the technical spirit of the present invention. It will be apparent to those skilled in the art.

특히, 전술한 본 발명은 음성 및 제스처 입력의 융합에 따른 모달리티 조합 규칙들을 일 실시예로 기술하였으나, 상기 모달리티 조합 규칙은 다른 실시예로 다양한 모달리티 입력에 적용 가능하다.In particular, although the above-described present invention describes modality combination rules according to fusion of voice and gesture input as an embodiment, the modality combination rules may be applied to various modality inputs in another embodiment.

상기한 바와 같은 본 발명에 따른 멀티모달 융합 처리 방법 및 그 장치는, 음성, 제스처 등 모든 종류의 모달리티 입력을 인식하고 지원하도록 정의된 ActionXML 모달리티 조합 규칙 정보를 통해, 단일 또는 다종의 사용자 입력에 대응되는 시스템 명령어가 추론되도록 함으로써, 다종의 모달리티에 대한 융합이 소형화된 저사양의 시스템에서도 쉽고 간편하게 수행되도록 하는 효과를 가진다.The multi-modal fusion processing method and apparatus according to the present invention as described above correspond to a single or multiple user inputs through ActionXML modality combination rule information defined to recognize and support all kinds of modality inputs such as voice and gestures. By inferring the system instructions to be inferred, the convergence of the various modalities can be easily and simply performed even in a compact low-end system.

또한, 본 발명에 따른 멀티모달 융합 처리 방법 및 그 장치는, 사용자 또는 생산자가 새로운 모달리티를 추가하거나 명령을 추가하기 위해, 모달리티 조합 규칙을 포함하는 ActionXML 문서만을 수정하도록 함으로써, 시스템의 유지 보수가 간편화 되고, 새로운 입력 모달리티의 추가가 용이하며, 시스템의 확장 및 개발이 용이해 지도록 하는 효과를 가진다. In addition, the multimodal fusion processing method and apparatus according to the present invention simplify the maintenance of the system by allowing a user or a producer to modify only an ActionXML document including modality combination rules in order to add a new modality or add a command. In addition, it is easy to add a new input modality, and the expansion and development of the system can be easily performed.

Claims

A multimodal fusion device of a system having a single or multiple modality recognizers,

An input processing unit which receives single or multiple types of user input recognition information through corresponding recognizers and processes the information into inferable information;

A rule storage unit for parsing and storing modality combination rule information; And

And an inference engine that infers an action corresponding to the processed user input recognition information with reference to the modality combination rule information.

The method of claim 1,

The rule storage unit,

Multimodal fusion device characterized in that for parsing the modality combination rule information in the XML form.

The method of claim 2,

The modality combination rule information of the XML form is

A first element defining an action, a second element that lists the modalities that can be input to the action, and a method for a meaningful combination of the above listed input modalities for the action. A third element presenting, a fourth element enumerating instructions of a corresponding modality, a fifth element referring to one of the instructions of the modality, and a combination of several actions; 6. A multimodal fusion device comprising at least one of six elements (set).

The method of claim 3,

The third element is a child element of at least one of an element indicating that only one condition needs to be satisfied for at least one or more combination conditions according to the input modalities listed above, and an element indicating that all the conditions must be satisfied. Including,

The fourth element includes an element representing the modality instruction list as a child element.

The sixth element defines a sequence of previously inferred actions, an element including an element representing an action of each action as a child element, and a meaningful maximum input time for each action. Multimodal fusion device, characterized in that it comprises at least one or more of the elements (time) representing as a child element.

The method of claim 3,

The first element may include a name of a corresponding action, a state value indicating whether the corresponding action is multi or single, a subname for distinguishing actions of the same name, and a system input. Contains at least one of the values indicating whether it can be used as an attribute,

The third element includes a weight value for calculating a weighting value as an attribute,

The fourth element includes at least one of a name attribute value name and an input modality mode as an attribute.

And the sixth element includes at least one or more of an order value and a time value representing an input order of the plurality of actions as attributes.

The method of claim 1,

The action is,

And a system command of the system.

The method of claim 1,

The inference engine unit,

With reference to the modality combining rule information and the pre-generated action, it is determined whether the processed user input recognition information is fusionable and fusion necessary information, and if the information is fusionable and necessary, the new action inference is performed. Multimodal fusion device.

The method of claim 1,

And a result storage unit for storing the actions generated from the inference engine in chronological order and providing pre-stored actions as information for inference of the inference engine.

The method of claim 1,

When the action is generated from the inference engine unit, verify whether the action is valid, and if it is determined that the action is valid, deliver the action to the system input, and if it is determined that the action is not valid to generate action error information Multimodal fusion device characterized in that it further comprises a verification unit.

The method of claim 9,

And a feedback generation unit converting the generated action error information into information for informing a user and delivering the result to a system output.

In the multi-modal fusion method of a system having a single or multiple modality recognizer,

Parsing and storing modality combination rule information at system initial stage;

If single or multiple types of user input recognition information is input from the corresponding recognizer, processing the same as action inferable information and storing the same; And

And inferring an action corresponding to the processed user input recognition information with reference to the modality combination rule information.

The method of claim 11,

The parsing step,

A multimodal fusion method characterized by parsing modality combination rule information in an XML form.

The method of claim 12,

The modality combination rule information of the XML form is

A first element defining an action, a second element that lists the modalities that can be input to the action, and a method for a meaningful combination of the above listed input modalities for the action. A third element to present, a fourth element that lists instructions of the modality, a fifth element to combine one of the instructions of the modality, and multiple actions A multimodal fusion method comprising at least one of six elements (set).

The method of claim 11,

The action is,

And a system instruction of the system.

The method of claim 11,

Inferring the action,

With reference to the modality combination rule information and the pre-generated action, sequentially determine whether the processed user input recognition information is fusionable and fusion necessary information, and infer the action when it is determined to be the fusionable and necessary information. Multi-modal fusion method characterized in that.

The method of claim 10,

Verifying whether the inferred action is valid;

If it is determined that the inferred action is valid, passing the action to a system input and generating action error information if it is determined to be invalid; And

And converting the generated action error information into information for informing a user and transferring the generated action error information to a system output.