KR20170098573A

KR20170098573A - Multi-modal learning device and multi-modal learning method

Info

Publication number: KR20170098573A
Application number: KR1020160020665A
Authority: KR
Inventors: 정상근
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2016-02-22
Filing date: 2016-02-22
Publication date: 2017-08-30
Also published as: KR102360246B1

Abstract

The present invention relates to a multimodal learning apparatus for creating an environment capable of reasonable and smart multimodal recognition without loss of work for aligning/synchronizing multimodals by proposing multimodal embedding so that a recognition result of each of the multimodals is consistently expressed with respect to one space in multimodal recognition-based learning, and a multimodal learning method. The multimodal learning apparatus comprises: a multimodal recognition part; a task confirmation part; a multimodal embedding processing part; and a multimodal learning part.

Description

[0001] MULTI-MODAL LEARNING DEVICE AND MULTI-MODAL LEARNING METHOD [0002]

본 발명은, 멀티모달 학습에 관한 것으로, 보다 구체적으로는 멀티모달 인식을 기반으로 하는 학습에 있어서 멀티모달 각각의 인식결과가 하나의 공간에 일관되게 표현되도록 하는 멀티모달 임베딩을 제안함으로써, 멀티모달 간 정렬/동기화하기 위한 작업 손실 없는 합리적이고 스마트한 멀티모달 인식이 가능한 환경을 조성하는 방안에 관한 것이다.The present invention relates to multimodal learning, and more specifically, by proposing a multimodal embedding in which multimodal recognition results of each multimodal are consistently expressed in a single space in learning based on multimodal recognition, The present invention relates to a method for creating a reasonable and smart multi-modal recognition environment without loss of work for aligning / synchronizing data.

최근, 사람을 도와주거나 더불어 살며 사람과 같은 인식, 사고, 판단을 할 수 있는 인공지능을 이용하는 이용하는 로봇 및 소프트웨어에 대한 관심이 높아지고 있으며, 이러한 인공지능에 대한 연구 또한 활발하게 진행되고 있다.In recent years, there has been a growing interest in robots and software that utilize artificial intelligence to help people, live with them, and be able to recognize, think, and judge the same person, and research on such artificial intelligence is actively under way.

이러한 연구 중에 가장 핵심이 되는 것은, 새로운 지식(정보)에 대한 학습 분야일 것이다.The most crucial part of these studies will be the field of learning about new knowledge (information).

인공지능(이하, 컴퓨터라 함)이 사람과 같은 사고, 판단을 하기 위해서는, 최초 설계된 지식(정보) 만을 가지고는 충분하지 않다. 즉, 시간이 지날수록 새로운 지식(정보)들이 늘어나고, 컴퓨터로 하여금 이러한 지식(정보)을 사람처럼 습득(학습)하게 하는 방법이 필요하다. In order for an artificial intelligence (hereinafter referred to as a computer) to make an accident or judgment like a person, it is not sufficient to have only the originally designed knowledge (information). In other words, as time goes by, new knowledge (information) increases, and a way to let the computer acquire (learn) such knowledge (information) is necessary.

헌데 기존의 학습 방법은, 사용자가 가르쳐준 지식(정보)에 한정된 학습 결과를 바탕으로 하기 때문에 기대 만큼 충분히 스마트하지 못한 한계를 갖는다.However, existing learning methods are based on learning results that are limited to the knowledge (information) taught by the user, and thus have limitations that are not smart enough.

또한, 기존의 학습 방법은, 컴퓨터에 멀티모달(예: 음성 인식, 영상 인식, 자연어 인식)을 통해 지식(정보)을 가르치는 경우, 음성 인식기능을 통해 인식된 인식데이터, 영상 인식기능을 통해 인식된 인식데이터, 자연어 인식기능을 통해 인식된 인식데이터가, 각각 독립적으로 임베딩(예: 워드 임베딩) 처리되어 각기 다른 공간 상에서 표현되기 때문에, 멀티모달 간 정렬/동기화 작업이 반드시 필요하다는 점, 이로 인해 멀티모달 간을 연결 짓는 방대한 참조성정보 및 높은 연산량이 요구되는 단점이 있다.In addition, the conventional learning method is a method of recognizing knowledge (information) through a multi-modal (e.g., speech recognition, image recognition, natural language recognition) The multimodal alignment / synchronization operation is indispensable since the recognition data recognized by the natural language recognition function is independently embed- ded (e.g., word-embedded) and displayed in different spaces, There is a disadvantage in that a large amount of reference information and a high calculation amount are required to connect the multimodal links.

이에, 본 발명에서는, 멀티모달 인식을 기반으로 하는 학습에 있어서, 전술한 기존의 학습 방법이 갖는 단점이 개선된 효율적인 새로운 학습(자가 학습) 방식을 제안하고자 한다.Accordingly, the present invention proposes an efficient new learning (self-learning) method in which the disadvantages of the above-described conventional learning method are improved in learning based on multimodal recognition.

본 발명은 상기한 사정을 감안하여 창출된 것으로서, 본 발명에서 도달하고자 하는 목적은, 멀티모달 인식을 기반으로 하는 학습에 있어서 멀티모달 각각의 인식결과가 하나의 공간에 일관되게 표현되도록 하는 멀티모달 임베딩을 제안함으로써, 합리적이고 스마트한 멀티모달 인식이 가능한 환경을 조성하는 멀티모달학습장치 및 멀티모달 학습 방법을 제공하는데 있다.The present invention has been made in view of the above circumstances, and an object to be achieved by the present invention is to provide a multi-modal recognition method and a multi-modal recognition method in which multi-modal recognition results are consistently expressed in one space, Modal learning apparatus and multimodal learning method for creating an environment capable of rational and smart multimodal recognition by suggesting embedding.

상기 목적을 달성하기 위한 본 발명의 제 1 관점에 따른 멀티모달학습장치는, 서로 다른 2 이상의 인식기능을 통해, 객체를 인식하는 멀티모달인식부; 상기 객체에 대한 인식이 학습 태스크인지 여부를 확인하는 태스크확인부; 학습 태스크로 확인되면, 상기 2 이상의 인식기능 각각을 통해 인식된 인식데이터를 결합한 후 임베딩(embedding) 처리하여, 하나의 벡터 공간 상에서 상기 객체에 대한 멀티모달 학습값을 결정하는 멀티모달임베딩처리부; 및 상기 객체의 멀티모달 학습값에 기초하여, 상기 2 이상의 인식기능 별로 상기 객체에 대한 인식결과로서의 인식벡터를 학습하는 멀티모달학습부를 포함한다.According to a first aspect of the present invention, there is provided a multimodal learning apparatus comprising: a multimodal recognition unit for recognizing an object through two or more different recognition functions; A task confirmation unit for confirming whether recognition of the object is a learning task; A multimodal embedding processing unit for combining recognition data recognized through each of the two or more recognizing functions and embedding the recognized recognition data and determining a multimodal learning value for the object in one vector space; And a multimodal learning unit that learns a recognition vector as a recognition result for the object for each of the two or more recognition functions based on the multimodal learning value of the object.

바람직하게는, 상기 2 이상의 인식기능 별로 결정된 각 인식벡터는, 상기 벡터 공간 상에서, 상호 동일한 벡터값이거나 또는 기 정의된 동일범위 내의 차이를 갖는 벡터값일 수 있다.Preferably, each recognition vector determined for each of the two or more recognition functions may be the same vector value on the vector space or a vector value having a difference within the same predetermined range.

바람직하게는, 상기 2 이상의 인식기능 중 적어도 하나의 인식기능과 관련된 연관 인식데이터를 수집하는 데이터수집부를 더 포함할 수 있다. The apparatus may further include a data collecting unit collecting association recognition data associated with at least one of the two or more recognition functions.

바람직하게는, 상기 멀티모달임베딩처리부는, 상기 적어도 하나의 인식기능의 인식데이터를 상기 수집된 연관 인식데이터로 순차적으로 교체하면서 상기 인식데이터 결합 및 임베딩 처리 과정을 반복하는 딥 러닝(deep learning) 방식을 기반으로, 상기 객체에 대한 멀티모달 학습값을 결정할 수 있다. Preferably, the multimodal embedding processing unit includes a deep learning method in which recognition data combining and embedding processing is repeated while sequentially replacing recognition data of the at least one recognition function with the collected association recognition data The multi-modal learning value for the object may be determined.

바람직하게는, 상기 2 이상의 인식기능은, 영상 인식기능, 음성 인식기능 및 자연어 인식기능을 포함하며, 상기 연관 인식데이터는, 영상 인식기능과 관련된 경우 자연어 인식기능을 통해 인식된 단어를 기반으로 검색된 연관 이미지를 포함하고, 음성 인식기능과 관련된 경우 상기 단어를 기반으로 검색된 연관 음성을 포함하고, 자연어 인식기능과 관련된 경우 상기 단어를 기반으로 검색된 연관 단어를 포함할 수 있다. Preferably, the two or more recognition functions include an image recognition function, a voice recognition function, and a natural language recognition function, wherein the association recognition data is related to an image recognition function, An association image, an associated voice retrieved based on the word when it is related to the voice recognition function, and an associated word retrieved based on the word when it is related to the natural language recognition function.

바람직하게는, 상기 학습 태스크에 따른 학습결과가 저장되며, 상기 2 이상의 인식기능 중 특정 인식기능을 통해 상기 객체를 인식한 인식데이터가 확인되면, 상기 학습결과에 근거하여 상기 특정 인식기능의 상기 인식데이터에 따른 인식벡터를 추출하고 상기 특정 인식기능 외의 나머지 인식기능에서 상기 추출한 인식벡터에 따른 인식결과를 출력하는 인식제어부를 더 포함할 수 있다.Preferably, the learning result according to the learning task is stored, and when the recognition data recognizing the object is identified through the specific recognition function among the two or more recognition functions, the recognition of the specific recognition function based on the learning result And a recognition control unit for extracting a recognition vector according to the data and outputting a recognition result according to the extracted recognition vector in remaining recognition functions other than the specific recognition function.

상기 목적을 달성하기 위한 본 발명의 제 2 관점에 따른 멀티모달 학습 방법은, 서로 다른 2 이상의 인식기능을 통해, 객체를 인식하는 멀티모달인식단계; 상기 객체에 대한 인식이 학습 태스크인지 여부를 확인하는 태스크확인단계; 학습 태스크로 확인되면, 상기 2 이상의 인식기능 각각을 통해 인식된 인식데이터를 결합한 후 임베딩(embedding) 처리하여, 하나의 벡터 공간 상에서 상기 객체에 대한 멀티모달 학습값을 결정하는 멀티모달임베딩단계; 및 상기 객체의 멀티모달 학습값에 기초하여, 상기 2 이상의 인식기능 별로 상기 객체에 대한 인식결과로서의 인식벡터를 학습하는 멀티모달학습단계를 포함한다.According to a second aspect of the present invention, there is provided a multimodal learning method comprising: a multimodal recognition step of recognizing an object through two or more different recognition functions; A task checking step of checking whether the recognition of the object is a learning task; A multimodal embedding step of combining recognition data recognized through each of the at least two recognizing functions and embedding the recognized recognition data to determine a multimodal learning value for the object in one vector space; And a multimodal learning step of learning a recognition vector as a recognition result for the object for each of the two or more recognition functions based on the multimodal learning value of the object.

바람직하게는, 상기 2 이상의 인식기능 별로 결정된 각 인식벡터는, 상기 벡터 공간 상에서, 상호 동일한 벡터값이거나 또는 기 정의된 동일범위 내의 차이를 갖는 벡터값일 수 있다. Preferably, each recognition vector determined for each of the two or more recognition functions may be the same vector value on the vector space or a vector value having a difference within the same predetermined range.

바람직하게는, 상기 2 이상의 인식기능 중 적어도 하나의 인식기능과 관련된 연관 인식데이터를 수집하는 단계를 더 포함하며; 상기 멀티모달임베딩단계는, 상기 적어도 하나의 인식기능의 인식데이터를 상기 수집된 연관 인식데이터로 순차적으로 교체하면서 상기 인식데이터 결합 및 임베딩 처리 과정을 반복하는 딥 러닝(deep learning) 방식을 기반으로, 상기 객체에 대한 멀티모달 학습값을 결정할 수 있다. Preferably, the method further comprises collecting association recognition data associated with at least one of the two or more recognition functions; Wherein the multimodal embedding step is based on a deep learning method in which recognition data combining and embedding processing is repeated while sequentially replacing recognition data of the at least one recognition function with the collected association recognition data, And may determine a multimodal learning value for the object.

바람직하게는, 상기 학습 태스크에 따른 학습결과가 저장되는 단계; 상기 2 이상의 인식기능 중 특정 인식기능을 통해 상기 객체를 인식한 인식데이터가 확인되면, 상기 학습결과에 근거하여 상기 특정 인식기능의 상기 인식데이터에 따른 인식벡터를 추출하고 상기 특정 인식기능 외의 나머지 인식기능에서 상기 추출한 인식벡터에 따른 인식결과를 출력하는 인식단계를 더 포함할 수 있다.Preferably, the learning result according to the learning task is stored. Extracting a recognition vector according to the recognition data of the specific recognition function based on the learning result and recognizing the remaining recognition other than the specific recognition function based on the learning result when the recognition data recognizing the object is identified through the specific recognition function among the two or more recognition functions And a recognition step of outputting a recognition result according to the extracted recognition vector in the function.

이에, 본 발명에 따른 멀티모달학습장치 및 멀티모달 학습 방법에 의하면, 멀티모달 인식을 기반으로 하는 학습에 있어서 멀티모달 각각의 인식결과가 하나의 공간에 일관되게 표현되도록 하는 멀티모달 임베딩을 제안함으로써, 멀티모달 간 정렬/동기화하기 위한 작업 손실 없는 합리적이고 스마트한 멀티모달 인식이 가능한 환경을 조성하는 효과를 도출한다.Therefore, according to the multimodal learning apparatus and the multimodal learning method according to the present invention, in the multimodal recognition-based learning, the multimodal embedding in which the recognition result of each multimodal is consistently expressed in one space is proposed And multi-modal inter-alignment / synchronicity without job loss.

도 1은 기존의 일반적인 멀티모달 학습 과정을 보여주는 예시도이다.
도 2는 본 발명의 바람직한 실시예에 따른 멀티모달학습장치의 구성을 보여주는 블록도이다.
도 3 및 도 4는 본 발명에 따른 멀티모달 학습 과정에서 인식데이터 및 연관 인식데이터를 보여주는 예시도이다.
도 5는 본 발명의 바람직한 실시예에 따른 멀티모달 학습 과정을 보여주는 예시도이다.
도 6은 본 발명의 바람직한 실시예에 따른 멀티모달 학습 방법의 흐름을 보여주는 흐름도이다.1 is an exemplary diagram showing a conventional general multimodal learning process.
2 is a block diagram illustrating a configuration of a multi-modal learning apparatus according to a preferred embodiment of the present invention.
3 and 4 are views illustrating recognition data and association recognition data in a multimodal learning process according to the present invention.
5 is an exemplary diagram illustrating a multimodal learning process according to a preferred embodiment of the present invention.
FIG. 6 is a flowchart illustrating a flow of a multimodal learning method according to a preferred embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 실시예에 대하여 구체적으로 설명하겠다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도면을 참조한 본 발명의 구체적인 설명에 앞서, 본 발명이 적용되는 기술 분야에 대해 먼저 설명하겠다.Before describing the present invention with reference to the drawings, the technical field to which the present invention is applied will be described first.

본 발명은, 멀티모달 인식을 기반으로 하는 학습, 예컨대 음성, 영상, 자연어를 기반으로 하는 멀티모달 학습에 관한 것이다. The present invention relates to multimodal learning based on multimodal recognition, for example, multimodal learning based on voice, video, and natural language.

인공지능(이하, 컴퓨터라 함)이 사람과 같은 인식, 사고, 판단을 하기 위해서는, 최초 설계된 지식(정보) 만을 가지고는 충분하지 않다. 즉, 시간이 지날수록 새로운 지식(정보)들이 늘어나고, 컴퓨터로 하여금 이러한 지식(정보)을 사람처럼 습득(학습)하게 하는 방법이 필요하다. In order for artificial intelligence (hereinafter referred to as "computer") to perform recognition, thinking, and judgment like a person, it is not sufficient to have only the initially designed knowledge (information). In other words, as time goes by, new knowledge (information) increases, and a way to let the computer acquire (learn) such knowledge (information) is necessary.

또한, 기존의 학습 방법은, 멀티모달(예: 음성 인식, 영상 인식, 자연어 인식)을 통해 지식(정보)을 가르치는 경우, 음성 인식기능을 통해 인식된 인식데이터, 영상 인식기능을 통해 인식된 인식데이터, 자연어 인식기능을 통해 인식된 인식데이터가, 각각 독립적으로 임베딩(예: 워드 임베딩) 처리되어 각기 다른 공간 상에서 표현되기 때문에, 멀티모달 간 정렬/동기화 작업이 반드시 필요하다는 점, 이로 인해 멀티모달 간을 연결 짓는 방대한 참조성정보 및 높은 연산량이 요구되는 단점이 있다.In addition, the conventional learning method can be classified into recognition data recognized through the speech recognition function, recognition recognized through the image recognition function, and the like recognized when the knowledge (information) is taught through a multimodal (e.g., speech recognition, image recognition, Data and recognition data recognized through the natural language recognition function are independently embed- ded (word embedding), for example, and are expressed in different spaces. Therefore, multimodal alignment / synchronization operations are indispensable, There is a disadvantage that a large amount of reference information and a high calculation amount are required to connect the liver.

이하에서는, 도 1을 참조하여, 기존의 일반적인 멀티모달 학습 과정을 간단하게 설명하겠다.Hereinafter, a conventional general multimodal learning process will be briefly described with reference to FIG.

이하에서는 설명의 편의 상, 멀티모달 인식을 위한 2 이상의 인식기능으로서, 음성 인식기능, 영상 인식기능 및 자연어 인식기능을 언급하여 설명한다.Hereinafter, for the sake of convenience of explanation, the speech recognition function, the image recognition function, and the natural language recognition function will be described as two or more recognition functions for multimodal recognition.

이에, 사용자가 객체 예컨대 사과를 가리키면서 "이건 사과야"라고 말을 했다고 가정한다.It is assumed that the user has pointed to an object, such as an apple, and said, "This is an apple."

이렇게 되면, 컴퓨터의 음성 인식기능은 사용자의 음성 "이건 사과야"을 인식하고, 영상 인식기능은 촬영된 영상에서 사용자가 가리키는 이미지_사과를 인식하고, 자연어 인식기능은 인식된 음성(이건 사과야)을 자연어로 처리한 후 학습 대상인 단어_사과를 인식한다.In this case, the voice recognition function of the computer recognizes the voice of the user "apple apple ", and the image recognition function recognizes the image_apple pointed by the user in the photographed image, and the natural language recognition function recognizes the recognized voice ) Is treated as a natural language, and then the word _ apple is recognized.

이하에서는, 음성 인식기능을 통해 인식된 인식데이터는 단어_사과와 동기되는 음성 "사과"인 것으로 설명하고, 영상 인식기능을 통해 인식된 인식데이터는 이미지_사과인 것으로 설명하고, 자연어 인식기능을 통해 인식된 인식데이터는 단어_사과인 것으로 설명하겠다.Hereinafter, the recognition data recognized through the speech recognition function will be described as being a voice "apple" synchronized with the word_api, the recognition data recognized through the image recognition function will be referred to as an image_apple, The perceived recognition data is explained as the word _ apple.

기존에 따르면, 도 1에 도시된 바와 같이 음성 인식기능을 통해 인식된 인식데이터 "사과", 영상 인식기능을 통해 인식된 인식데이터 이미지_사과, 자연어 인식기능을 통해 인식된 인식데이터 단어_사과는, 각각 독립적으로 임베딩(예: 워드 임베딩) 처리된다.1, recognition data "apples" recognized through a voice recognition function, recognition data image " apples " recognized through an image recognition function, and recognition data words " apples " recognized through a natural language recognition function , Each of which is independently embedding (e.g., word embedding).

이렇게 음성/영상/자연어 인식기능 별로 임베딩 처리된 독립적인 인식심볼A,B,C는, 금번 객체 즉 사과에 대한 인식결과로서 학습된 결과물이다.Independent recognition symbols A, B, and C embedded in each voice / video / natural language recognition function are the result of recognition as the result of recognition of this object, apple.

여기서, 인식심볼A,B,C는, 독립적인 임베딩 처리로 인해, 각기 다른 공간 상에서 표현되는 값, 달리 말하면 서로 다른 형태의 값일 것이다.Here, the recognition symbols A, B, and C may be values expressed in different spaces, in other words, different types of values, due to the independent embedding process.

따라서, 추후에 멀티모달 즉 음성/영상/자연어 인식기능을 통해 금번 객체 즉 사과를 인식해 내기 위해서는, 음성/영상/자연어 인식기능의 각기 다른 공간 상에서 표현되는 독립된 인식심볼A,B,C 간에 참조 관계를 연결 짓는 절차 즉 멀티모달 간 정렬/동기화 작업이 필요하다. Therefore, in order to recognize this object or apple by means of multimodal, that is, voice / video / natural language recognition function, it is necessary to refer to the independent recognition symbols A, B and C represented in different spaces of voice / The process of linking relationships, that is, multi-modal alignment / synchronization, is needed.

이에, 추후에 사용자가 동일한 객체 즉 동일한 사과를 가리키면서 "이게 뭐지?" 라고 말하는 경우, 컴퓨터의 영상 인식기능을 통해 인식된 인식데이터는 이미지_사과의 인식결과로 학습된 인식심볼B를 추출하고, 앞서 수행된 멀티모달 간 정렬/동기화 작업의 결과(참조성정보)로부터 인식심볼B와는 다른 공간 상에 표현되며 인식심볼B와 참조 관계가 연결된 인식심볼A 및/또는 C를 찾아서, 금번 "이게 뭐지"의 식결과로서 인식심볼A 및/또는 C에 따른 음성 "사과" 및/또는 자연어 사과를 출력하는 방식으로 사과를 구분해낼 수 있다.So, later, when the user points to the same object, the same apple, "What is this?" , The recognition data recognized through the image recognition function of the computer extracts the learned recognition symbol B from the recognition result of the image_api and extracts the recognition symbol B from the result of the multi-modal inter-alignment / synchronization operation A "and" C "according to the recognition symbol A and / or C as a result of the expression of" what is this "is obtained by finding a recognition symbol A and / or C expressed in a space different from the recognition symbol B and connected to the recognition symbol B, And / or to print apples in natural language.

따라서, 이 경우라면 만약 사용자가 전술과 같이 사과를 학습시킨 후 다른 모양의 사과를 보여주면서 "이게 뭐지?" 라고 말하는 경우, 학습된 결과물과 일치하지 않기 때문에 컴퓨터는 사과를 구분해내지 못할 수 있다.So, in this case, if the user learns the apology as described above and then shows the apple in the other shape, "What is this?" , The computer may not be able to distinguish the apology because it does not match the learned output.

즉, 기존의 멀티모달 학습 방법에 따르면, 멀티모달 인식 시 기대 만큼 충분히 스마트하지 못한 것이다.That is, according to the existing multimodal learning method, it is not smart enough as expected in the multimodal recognition.

또한, 기존의 멀티모달 학습 방법은, 도 1에서 알 수 있듯이, 음성 인식기능을 통해 인식된 인식데이터 "사과", 영상 인식기능을 통해 인식된 인식데이터 이미지_사과, 자연어 인식기능을 통해 인식된 인식데이터 단어_사과가, 각각 독립적으로 임베딩(예: 워드 임베딩) 처리되어 각기 다른 공간 상에서 표현되기 때문에, 멀티모달 간 정렬/동기화 작업이 반드시 필요하다.As shown in FIG. 1, the conventional multimodal learning method is a method in which recognition data "apple" recognized through the voice recognition function, recognition data image_app recognized through the image recognition function, Since the recognition data words and apples are each independently embeddable (eg word embedding) and represented in different spaces, multi-modal inter-alignment / synchronization operations are essential.

이로 인해, 기존의 멀티모달 학습 방법은, 방대한 양의 지식(정보)을 학습시킴에 따라, 방대한 양의 지식(정보)에 대해 전술과 같은 학습 과정을 거치면서 수 많은 횟수의 멀티모달 간 정렬/동기화 작업을 수행하게 된다.As a result, the existing multimodal learning method can learn a large amount of knowledge (information), so that a large amount of knowledge (information) The synchronization operation is performed.

그리고, 기존의 멀티모달 학습 방법에 기인한 멀티모달 인식 시, 수 많은 횟수의 멀티모달 간 정렬/동기화 작업 결과 즉 방대한 크기의 참조성정보를 거쳐야한다.In multimodal recognition due to the existing multimodal learning method, the multimodal inter-alignment / synchronization operation, that is, a large number of referencing information, must be performed.

이와 같은 이유로, 기존의 멀티모달 학습 방법은, 방대한 참조성정보 및 높은 연산량이 요구되는 단점이 있다.For this reason, the existing multimodal learning method has a disadvantage in that it requires a large amount of reference information and a high calculation amount.

이에, 본 발명에서는, 멀티모달 인식을 기반으로 하는 학습에 있어서, 전술한 기존의 학습 방법이 갖는 단점이 개선된 새로운 학습(자가 학습) 방식을 제안하고자 한다.Accordingly, the present invention proposes a new learning (self-learning) method in which the disadvantages of the existing learning method described above are improved in learning based on multimodal recognition.

보다 구체적으로는, 본 발명에서는, 기존의 스마트하지 못한 단점이 한정된 학습 결과로 인해 야기되는 점, 방대한 참조성정보 및 높은 연산량이 요구되는 단점이 음성/영상/자연어 인식기능 별로 독립된 임베딩 처리에 따라 각기 다른 공간 상에서 표현되는 값(다른 형태의 값)이기 때문에 야기되는 점에 기인하여, 이를 개선한 새로운 학습(자가 학습) 방식을 제안함으로써, 멀티모달 간 정렬/동기화하기 위한 작업 손실 없는 합리적이고 스마트한 멀티모달 인식이 가능한 환경을 조성하고자 한다.More specifically, in the present invention, the existing non-smart disadvantages are caused by limited learning results, a large amount of reference information, and a disadvantage that a high computation amount is required is caused by independent embedding processing for each voice / video / By suggesting a new learning (self-learning) method that improves this because of the value that is expressed in different spaces (different types of values), it is reasonable and smart to work with multi-modal alignment / synchronization We want to create a multi-modal recognition environment.

이하에서는, 도 2를 참조하여 본 발명의 바람직한 실시예에 따른 멀티모달학습장치의 구성을 구체적으로 설명하겠다.Hereinafter, the configuration of a multi-modal learning apparatus according to a preferred embodiment of the present invention will be described in detail with reference to FIG.

도 2에 도시된 바와 같이 본 발명에 따른 멀티모달학습장치(100)는, 서로 다른 2 이상의 인식기능을 통해, 객체를 인식하는 멀티모달인식부(110)와, 상기 객체에 대한 인식이 학습 태스크인지 여부를 확인하는 태스크확인부(120)와, 학습 태스크로 확인되면, 상기 2 이상의 인식기능 각각을 통해 인식된 인식데이터를 결합한 후 임베딩(embedding) 처리하여, 하나의 벡터 공간 상에서 상기 객체에 대한 멀티모달 학습값을 결정하는 멀티모달임베딩처리부(130)와, 상기 객체의 멀티모달 학습값에 기초하여, 상기 2 이상의 인식기능 별로 상기 객체에 대한 인식결과로서의 인식벡터를 학습하는 멀티모달학습부(140)를 포함한다.2, the multimodal learning apparatus 100 according to the present invention includes a multimodal recognition unit 110 for recognizing an object through two or more different recognition functions, A task identifying unit 120 for identifying whether or not the recognition task is recognized as a learning task, and, if it is confirmed as a learning task, recognizing recognition data recognized through each of the two or more recognition functions is embedded and embedding processing is performed, A multimodal learning unit (130) for learning a multimodal learning value of the object based on the multimodal learning value of the object, 140).

멀티모달인식부(110)는, 서로 다른 2 이상의 인식기능을 통해, 객체를 인식한다.The multimodal recognition unit 110 recognizes an object through two or more different recognition functions.

이하에서는, 2 이상의 인식기능으로서, 음성 인식기능, 영상 인식기능 및 자연어 인식기능을 언급하여 설명하겠다.Hereinafter, the speech recognition function, the image recognition function, and the natural language recognition function will be described as two or more recognition functions.

즉, 멀티모달인식부(110)는, 서로 다른 2 이상의 인식기능 즉 음성 인식기능, 영상 인식기능 및 자연어 인식기능을 통해, 객체를 인식한다.That is, the multimodal recognition unit 110 recognizes an object through two or more different recognition functions, that is, a voice recognition function, a video recognition function, and a natural language recognition function.

예를 들어, 사용자가 객체 예컨대 사과를 가리키면서 "이건 사과라고 해"라고 말을 했다고 가정하면, 멀티모달인식부(110)는, 음성 인식기능을 통해 사용자의 음성 "이건 사과라고 해"을 인식하고, 영상 인식기능을 통해 촬영된 영상에서 사용자가 가리키는 이미지_사과를 인식하고, 자연어 인식기능을 통해 전술의 인식된 음성(이건 사과라고 해)을 자연어로 처리한 후 학습 또는 인식 대상인 단어_사과를 인식한다.For example, assuming that the user refers to an object, such as an apple, and said "say it is apology", the multimodal recognition unit 110 recognizes the user's voice "say apology" through the voice recognition function , Recognizes the image_apple pointed by the user in the image captured by the image recognition function, processes the recognized voice (referred to as an apology) of the aforementioned tactics in a natural language through the natural language recognition function, .

태스크확인부(120)는, 객체에 대한 인식이 학습 태스크인지 여부를 확인한다.The task checking unit 120 determines whether or not recognition of an object is a learning task.

구체적으로, 태스크확인부(120)는, 자연어 인식기능을 통해 자연어 처리된 문장(이건 사과라고 해)을 분석하여, 어떤 객체(Object)를 가르치고자 하는 형태의 문장인 것으로 판단되면 금번 객체에 대한 인식이 학습 태스크인 것으로 확인할 수 있다.Specifically, the task verification unit 120 analyzes a sentence (referred to as apology) processed in a natural language through the natural language recognition function, and determines that the sentence is a sentence in which a certain object (object) It can be confirmed that the recognition is a learning task.

이에, 도 3에 도시된 바와 같이, 멀티모달인식부(110)의 자연어 인식기능을 통해 인식된 인식데이터는 단어_사과, 음성 인식기능을 통해 인식된 인식데이터는 단어_사과와 동기되는 음성 "사과", 영상 인식기능을 통해 인식된 인식데이터는 이미지_사과인 것으로 설명하고, 태스크확인부(120)에서는 분석 결과 새로운 객체(Object) 즉 사과를 가르치는 학습 태스크로 확인한 것으로 가정하겠다.3, the recognized data recognized through the natural language recognition function of the multimodal recognition unit 110 includes a word_apple, recognition data recognized through the voice recognition function, Apple ", the recognition data recognized through the image recognition function is image_apple, and the task confirmation unit 120 confirms the analysis result as a learning task for teaching a new object (apple).

멀티모달임베딩처리부(130)는, 태스크확인부(120)에서 학습 태스크로 확인되면, 2 이상의 인식기능 즉 음성/영상/자연어 인식기능 각각을 통해 인식된 인식데이터를 결합한 후 임베딩(embedding) 처리하여, 하나의 벡터 공간 상에서 객체 즉 사과에 대한 멀티모달 학습값을 결정한다.The multimodal embedding processing unit 130 combines recognition data recognized through two or more recognition functions, that is, recognition data through audio / video / natural language recognition functions, and embeds the recognition data , And determines a multimodal learning value for an object, i.e., apple, in one vector space.

구체적으로 설명하면, 멀티모달임베딩처리부(130)는, 멀티모달인식부(110)의 음성/영상/자연어 인식기능 각각을 통해 인식된 인식데이터, 즉 음성 인식기능을 통해 인식된 인식데이터 "사과", 영상 인식기능을 통해 인식된 인식데이터 이미지_사과, 자연어 인식기능을 통해 인식된 인식데이터 단어_사과를 결합한다.More specifically, the multimodal embedding processing unit 130 recognizes recognition data recognized through each of the voice / video / natural language recognition functions of the multimodal recognition unit 110, that is, recognition data "apple" recognized through the voice recognition function, , Recognition data image_appear through image recognition function, and recognition data word_appear recognized through natural language recognition function.

도 5를 참조하여 설명하면, 인식데이터 "사과"(이하, 인식데이터(A)), 인식데이터 이미지_사과(이한 인식데이터(B)), 인식데이터 단어_사과(이하, 인식데이터(C))를 결합한다(A+B+C).5, recognition data "apple" (hereinafter referred to as recognition data A), recognition data image_apple (recognition data B), recognition data word_apple (hereinafter, recognition data C) (A + B + C).

물론, 도 5에서는 A+B+C의 순서로 결합하였지만, 그 결합 순서는 어떤 순서로 변경하여도 무방할 것이다.Of course, in FIG. 5, they are combined in the order of A + B + C, but the order of combining them may be changed in any order.

그리고, 멀티모달임베딩처리부(130)는, 음성/영상/자연어 인식기능 각각을 통해 인식된 인식데이터를 결합한 후, 결합한 인식데이터(A+B+C)를 임베딩 처리한다. Then, the multimodal embedding processing unit 130 combines recognition data recognized through each of the voice / video / natural language recognition functions, and then embeds the combined recognition data (A + B + C).

이때, 멀티모달임베딩처리부(130)의 임베딩 처리 방식은, 워드 임베딩을 비롯하여 기존의 다양한 임베딩 방식 중 하나의 방식 또는 2 이상의 방식을 채택하여 이용할 수 있으며, 그 구체적인 설명을 생략하도록 한다.At this time, the embedding processing method of the multimodal embedding processing unit 130 may employ one of the various embedding schemes including word embedding, or two or more embedding schemes, and a detailed description thereof will be omitted.

예컨대, 멀티모달임베딩처리부(130)는, 아래의 수학식1에 따른 임베딩 함수를 기반으로, 결합한 인식데이터(A+B+C)를 임베딩 처리할 수 있다.For example, the multimodal embedding processor 130 may embed the combined recognition data (A + B + C) based on the embedding function according to Equation 1 below.

수학식1Equation 1

f(W?X+b)f (W? X + b)

이하에서는, 멀티모달임베딩처리부(130)에 의해 처리되는 임베딩을 멀티모달 임베딩이라 명명하겠다.Hereinafter, the embedding processed by the multimodal embedding processor 130 will be referred to as multimodal embedding.

여기서, W는 멀티모달 임베딩 처리를 위해 학습된 가중치 매트릭스를 의미하고, X는 결합한 인식데이터(A+B+C)를 의미하고, b는 멀티모달 임베딩 처리를 위해 학습된 바이어스(bias)를 의미한다.Here, W denotes a weight matrix learned for multimodal embedding processing, X denotes combined recognition data (A + B + C), and b denotes a learned bias for multimodal embedding processing do.

멀티모달임베딩처리부(130)는, 전술과 같이 결합 인식데이터(A+B+C)를 멀티모달 임베딩 처리하여, 하나의 벡터 공간 상에서 금번 객체 즉 사과에 대한 멀티모달 학습값을 결정하게 된다.The multimodal embedding processing unit 130 performs multimodal embedding processing on the combination recognition data (A + B + C) as described above to determine a multimodal learning value for the current object, i.e. apple, in one vector space.

즉, 멀티모달임베딩처리부(130)는, 결합 인식데이터(A+B+C)를 멀티모달 임베딩 처리한 결과물을, 금번 객체 즉 사과에 대한 멀티모달 학습값 즉 하나의 벡터 공간 상에서 하나의 값으로 결정할 수 있다.That is, the multimodal embedding processing unit 130 converts the multimodal embedding processing result of the combining recognition data (A + B + C) into a multimodal learning value for the current object, i.e., apple, You can decide.

멀티모달학습부(140)는, 금번 객체 즉 사과의 멀티모달 학습값에 기초하여, 2 이상의 인식기능 즉 음성/영상/자연어 인식기능 별로 금번 객체 즉 사과에 대한 인식결과로서의 인식벡터를 학습한다.The multimodal learning unit 140 learns recognition vectors as recognition results for the current object, i.e. apple, by two or more recognition functions, that is, voice / video / natural language recognition functions, based on the multimodal learning value of this object, i.e., apple.

즉, 멀티모달학습부(140)는, 결합 인식데이터(A+B+C)를 멀티모달 임베딩 처리하여 얻은 하나의 멀티모달 학습값에 기초하여, 음성/영상/자연어 인식기능 별로 금번 객체 즉 사과에 대한 인식결과로서의 인식벡터를 학습하는 것이다.That is, the multimodal learning unit 140 generates a multimodal learning value for each voice / video / natural language recognition function based on one multimodal learning value obtained by multimodal embedding processing of the combination recognition data (A + B + C) As a result of recognition of the recognition vector.

이하에서는, 설명의 편의 상, 음성/영상/자연어 인식기능 별로 인식벡터(A), (B), (C)를 학습한 것으로 가정하겠다.Hereinafter, for convenience of explanation, it is assumed that recognition vectors (A), (B), and (C) are learned for each voice / video / natural language recognition function.

이 경우, 인식벡터(A), (B), (C)는, 하나의 벡터 공간 상에서 표현된 하나의 멀티모달 학습값에 기초하여 학습된 벡터로서, 하나의 벡터 공간 상에서 상호 동일한 벡터값이거나 또는 기 정의된 동일범위 내의 차이를 갖는 벡터값인 것이 바람직하다.In this case, the recognition vectors (A), (B), and (C) are vectors learned based on one multimodal learning value expressed in one vector space, Is a vector value having a difference within the same defined range.

다시 말해, 인식벡터(A), (B), (C)는, 본 발명에서 제안하는 새로운 임베딩 체계 즉 멀티모달 임베딩 체계를 통해서 결합된 인식데이터(A+B+C)를 멀티모달 임베딩 처리하여 얻은 하나의 멀티모달 학습값에 기초하여, 금번 객체 즉 사과에 대한 인식결과로서 학습된 결과물이며, 따라서 이들은 하나의 벡터 공간 상에 사상된 상호 동일한 형태(벡터)를 갖는 동일한 벡터값(또는 동일한 것으로 볼 수 있는 벡터값)이 된다.In other words, the recognition vectors (A), (B), and (C) are generated by performing multimodal embedding processing on the recognition data (A + B + C) combined through the new embedding system (Or the same) vector having the same shape (vector) mapped on one vector space as a result of recognition based on the obtained multi-modal learning value, Vector value).

이때, 멀티모달학습부(140)는, 멀티모달 학습값을 그대로 벡터화하여 음성/영상/자연어 인식기능 별로 인식벡터(A), (B), (C)를 학습할 수도 있고, 특정 학습 알고리즘을 거쳐 벡터화하여 음성/영상/자연어 인식기능 별로 인식벡터(A), (B), (C)를 학습할 수도 있다. At this time, the multimodal learning unit 140 may learn the recognition vectors (A), (B), and (C) according to the voice / video / (A), (B), and (C) for each voice / video / natural language recognition function.

즉, 하나의 멀티모달 학습값에 기초하여 음성/영상/자연어 인식기능 별로 하나의 벡터 공간 상에서 상호 동일한 벡터값(또는 동일한 것으로 볼 수 있는 벡터값)인 인식벡터(A), (B), (C)를 학습할 수 있는 방식이라면, 어떠한 방식이든 무관할 것이다. That is, recognition vectors (A), (B), and (B), which are mutually the same vector values (or vector values that can be seen as the same) in one vector space, for each voice / image / natural language recognition function based on one multimodal learning value C) can be learned, it will be irrelevant in any way.

그리고, 멀티모달학습장치(100)는, 금번 객체 즉 사과에 대한 학습 태스크에 따른 학습결과를 저장할 것이다.Then, the multimodal learning apparatus 100 will store the learning result according to the learning task for this object, i.e., apology.

이상에서 설명한 바와 같이, 본 발명은, 멀티모달(음성/영상/자연어)에서 인식된 각 인식데이터 결합 및 임베딩 처리하는 새로운 멀티모달 임베딩 체계를 통해 하나의 멀티모달 학습값을 얻고, 이를 기초로 멀티모달(음성/영상/자연어) 각각의 인식결과로서 하나의 벡터 공간 상에 사상된 인식벡터를 학습하기 때문에, 멀티모달(음성/영상/자연어) 간을 연결 짓는 즉 멀티모달 간 정렬/동기화 작업이 불필요해 진다.As described above, according to the present invention, one multimodal learning value is obtained through a new multimodal embedding system for combining and embedding recognition data recognized in a multimodal (audio / video / natural language) Since the recognition vector mapped on one vector space is learned as the recognition result of each modal (voice / video / natural language), multimodal inter-alignment / synchronization between multimodals (voice / video / natural language) It becomes unnecessary.

그리고 본 발명의 멀티모달학습장치(100)는, 데이터수집부(150) 및 인식제어부(160)를 더 포함할 수 있다.The multimodal learning apparatus 100 of the present invention may further include a data collection unit 150 and a recognition control unit 160.

데이터수집부(150)는, 멀티모달인식부(110)의 2 이상의 인식기능 즉 음성/영상/자연어 인식기능 중 적어도 하나의 인식기능과 관련된 연관 인식데이터를 수집한다.The data collecting unit 150 collects association recognition data related to at least one recognition function among the two or more recognition functions of the multimodal recognition unit 110, that is, voice / video / natural language recognition.

여기서, 적어도 하나의 인식기능은, 멀티모달 즉 음성/영상/자연어 중에서, 연관 인식데이터가 수집되는 인식기능을 의미한다.Here, at least one recognizing function means a recognizing function in which association recognizing data is collected from a multimodal, i.e., a voice / video / natural language.

이에, 연관 인식데이터는, 영상 인식기능과 관련된 경우 자연어 인식기능을 통해 인식된 단어(예: 사과)를 기반으로 검색된 연관 이미지, 예컨대 다양한 이미지_사과들을 포함할 수 있다.Thus, the association recognition data may include associated images, e.g., various image_apps, retrieved based on recognized words (e.g., apples) through a natural language recognition function in connection with the image recognition function.

또는, 연관 인식데이터는, 음성 인식기능과 관련된 경우 단어(예: 사과)를 기반으로 검색된 연관 음성, 예컨대 다양한 외국어에 따른 음성(예: "애플:영어" 등)들을 포함할 수 있다.Alternatively, the association recognition data may include associated voices retrieved based on words (e.g., apples), e.g., voices in various foreign languages (e.g., "Apple: English"

또는, 연관 인식데이터는, 자연어 인식기능과 관련된 경우 단어(예: 사과)를 기반으로 검색된 연관 단어, 예컨대 다양한 외국어에 따른 단어(예: Apple 등)들을 포함할 수 있다.Alternatively, the association recognition data may include related words retrieved based on a word (e.g., an apple) in relation to the natural language recognition function, e.g., words in various foreign languages (e.g., Apple, etc.).

그리고, 데이터수집부(150)는, 자연어 인식기능을 통해 인식된 단어(예: 사과)를 기반으로, 웹데이터로부터 직접 연관 인식데이터를 검색하여 수집할 수도 있고, 또는 별도의 다른 검색출처를 통해서 연관 인식데이터를 수집할 수도 있다.The data collecting unit 150 may search for and collect the association recognition data directly from the web data based on the recognized word (e.g., apple) through the natural language recognition function, It may collect the association recognition data.

이하에서는 설명의 편의 상, 도 4와 같이, 자연어 인식기능을 통해 인식된 인식데이터(C) 즉 단어_사과를 기반으로, 영상 인식기능과 관련된 연관 인식데이터 즉 다양한 이미지_사과들이 수집된 경우로 가정하여 설명하겠다.Hereinafter, as shown in FIG. 4, when the association recognition data related to the image recognition function, that is, the various image_appies, are collected based on the recognition data (C) recognized through the natural language recognition function I suppose that.

즉, 데이터수집부(150)는, 적어도 하나의 인식기능 즉 영상 인식기능과 관련된 연관 인식데이터 즉 다양한 이미지_사과들(이하, 인식데이터(B'),(B")...)을 수집한다.That is, the data collecting unit 150 collects at least one recognition function, i.e., association recognition data related to the image recognition function, that is, various image_approses (hereinafter, recognition data B ', B' do.

이렇게 연관 인식데이터가 수집되는 경우, 멀티모달임베딩처리부(130)는, 적어도 하나의 인식기능 즉 영상 인식기능의 인식데이터를 전술의 수집된 연관 인식데이터(다양한 이미지_사과들)로 순차적으로 교체하면서, 전술한 인식데이터 결합 및 임베딩 처리 과정을 반복하는 딥 러닝(deep learning) 방식을 기반으로, 금번 객체 즉 사과에 대한 멀티모달 학습값을 결정하게 된다.When the association recognition data is collected, the multimodal embedding processing unit 130 sequentially replaces at least one recognition function, that is, recognition data of the image recognition function, with the above-described collected association recognition data (various image_apps) , The multi-modal learning value for the current object, i.e. apple, is determined based on the deep learning method of repeating the recognition data combining and embedding process described above.

즉, 도 4에 도시된 바와 같이, 멀티모달임베딩처리부(130)는, 영상 인식기능의 인식데이터(B)를 전술의 수집된 연관 인식데이터 즉 인식데이터(B'),(B")...로 순차적으로 교체하면서, 본 발명에서 제안하는 멀티모달 임베딩 처리를 반복하는 딥 러닝을 수행함으로써, 금번 객체 즉 사과에 대한 멀티모달 학습값을 결정하게 된다.That is, as shown in FIG. 4, the multimodal embedding processor 130 converts the recognition data B of the image recognition function into the above-described collected association recognition data, that is, the recognition data B 'and B'. The multimodal learning value for the current object, i.e. apple, is determined by performing the deep learning that repeats the multimodal embedding process proposed in the present invention.

이후, 멀티모달학습부(140)는, 전술한 바와 같이, 금번 객체 즉 사과의 멀티모달 학습값에 기초하여, 음성/영상/자연어 인식기능 별로 금번 객체 즉 사과에 대한 인식결과로서의 인식벡터(A),(B),(C)를 학습할 것이다.As described above, the multimodal learning unit 140 generates a recognition vector A (A) as a recognition result for the present object, that is, apple, for each voice / video / natural language recognition function, based on the multimodal learning value of the current object, ), (B), and (C) will be studied.

인식제어부(160)는, 2 이상의 인식기능 즉 음성/영상/자연어 인식기능 중 특정 인식기능을 통해 전술의 객체 즉 사과를 인식한 인식데이터가 확인되면, 앞서 저장한 학습결과에 근거하여 특정 인식기능의 인식데이터에 따른 인식벡터를 추출하고 특정 인식기능 외의 나머지 인식기능에서 금번 추출한 인식벡터에 따른 인식결과를 출력할 수 있다.When recognition data recognizing the above-described object, that is, apology, is recognized through two or more recognition functions, that is, voice / video / natural language recognition, through the specific recognition function, the recognition control unit 160 generates a specific recognition function It is possible to extract a recognition vector according to the recognition data of the recognition recognition function and to output the recognition result according to the recognition vector extracted by the remaining recognition function other than the specific recognition function.

여기서, 특정 인식기능은, 음성/영상/자연어 인식기능 중 하나 또는 그 이상의 인식기능일 수 있고, 이하에서는 설명의 편의 상 영상 인식기능인 것으로 가정하여 설명하겠다.Here, the specific recognition function may be one or more recognition functions of the voice / video / natural language recognition functions, and the following description will be made assuming that it is the image recognition function for convenience of explanation.

예를 들어, 사용자가 객체 예컨대 사과를 가리키면서 "이게 뭐지?"라고 말을 했다고 가정하면, 전술의 멀티모달인식부(110)는, 음성 인식기능을 통해 사용자의 음성 "이게 뭐지?"을 인식하고, 영상 인식기능을 통해 촬영된 영상에서 사용자가 가리키는 이미지_사과를 인식한다.For example, supposing that the user has pointed to an object, such as an apple, and said "What is this?", The aforementioned multimodal recognition unit 110 recognizes the voice of the user "What is this?" Through the voice recognition function , Recognizes the image_apple pointed by the user in the image photographed through the image recognition function.

물론, 태스크확인부(120)는, 자연어 인식기능을 통해 자연어 처리된 문장(이게 뭐지)을 분석하여, 객체에 대한 금번 인식이 학습 태스크가 아닌 인식 태스크인 것을 확인할 것이다.Of course, the task verification unit 120 analyzes the sentence (what is) processed in the natural language through the natural language recognition function, and confirms that the present recognition of the object is the recognition task, not the learning task.

이렇게 되면, 인식제어부(160)는, 앞서 저장한 학습결과에 근거하여, 특정 인식기능을 즉 영상 인식기능에서 금번 인식된 인식데이터 이미지_사과에 따른 인식벡터(B)를 추출하고, 특정 인식기능 즉 영상 인식기능 외의 나머지 인식기능 즉 음성/자연어 인식기능에서 금번 추출한 인식벡터(B)에 따른 인식결과를 출력할 수 있다.In this case, the recognition control unit 160 extracts a specific recognition function, that is, a recognition vector B according to the recognized recognition data image_apple, which is currently recognized in the image recognition function, based on the learning result stored previously, That is, it is possible to output the recognition result according to the recognition vector (B) extracted by the remaining recognition function other than the image recognition function, that is, the voice / natural language recognition function.

다시 말하면, 인식제어부(160)는, 인식 태스크 시, 객체에 대하여 영상 인식기능을 통해 인식된 인식데이터 이미지_사과에 따른 인식벡터(B)를 추출하면, 기존에는 별도의 참조성정보로부터 인식벡터(B)와는 다른 공간 상에 표현되며 인식벡터(B)와 참조 관계가 연결된 다른 인식기능의 인식벡터를 찾아야 하는 것과 달리, 다른 인식기능 즉 음성/자연어 인식기능에서 인식벡터(B)에 따른 인식결과 즉 멀티모달 인식결과(예: 음성 "사과", 단어 사과)를 바로 출력하는 방식으로 사과를 구분해낼 수 있다.In other words, when the recognition control unit 160 extracts the recognition vector B according to the recognition data image_application recognized by the image recognition function with respect to the object at the recognition task, (B) in a different recognition function, that is, a voice / natural language recognition function, in which a recognition vector of another recognition function expressed in a space different from the recognition vector (B) As a result, it is possible to distinguish apples by directly outputting multimodal recognition results (eg, voice "apple", word apple).

이것이 가능한 이유는, 본 발명에서 제안한 멀티모달 임베딩 체계로 인해, 멀티모달(음성/영상/자연어) 각각의 인식결과로서 하나의 벡터 공간 상에 사상된 인식벡터(동일한 벡터값, 또는 동일한 것으로 볼 수 있는 벡터값)를 학습하였기 때문이다.This is possible because of the multimodal embedding scheme proposed in the present invention, the recognition vector mapped on one vector space (the same vector value or the same The vector value).

그리고, 본 발명에서는, 앞서 사용자가 "이게 뭐지?"라고 말하면서 가리킨 객체 즉 사과가, 앞서 학습시킨 것과 동일한 사과이든 아니면 다른 모양의 사과이든 무관하게, 전술한 바와 같은 동일한 인식결과 즉 영상 인식기능을 통해 인식된 인식데이터 이미지_사과에 따른 인식벡터(B) 추출 및 음성/자연어 인식기능에서 인식벡터(B)에 따른 멀티모달 인식결과(예: 음성 "사과", 단어 사과) 출력의 방식으로 사과를 구분해낼 것이다.In the present invention, regardless of whether the object pointed to by the user, that is, the apple, which is the same as the one previously taught, or the apple of another shape, the same recognition result as described above, i.e., (Eg, speech "apples", word apples) by a method of outputting a recognition vector (B) based on a recognition vector (B) and a multimodal recognition result .

이것이 가능한 이유는, 본 발명에서 제안하는 딥 러닝 방식 기반의 멀티모달 임베딩 체계로 인해, 멀티모달(음성/영상/자연어) 인식결과로서 하나의 벡터 공간 상에 사상된 인식벡터(동일한 벡터값, 또는 동일한 것으로 볼 수 있는 벡터값)를 학습하는 과정에서 스스로 수집한 지식(정보)를 이용함으로써, 한정된 학습 결과에서 더 나아가 스스로 학습하는 자가 학습이 실현되는 것과 같은 효과를 직/간접적으로 얻기 때문이다.This is possible because of the multi-modal embedding system based on the deep learning method proposed by the present invention, a recognition vector mapped on one vector space (the same vector value or (Vector) that can be regarded as the same) by using the knowledge (information) collected by itself in the course of learning the learning result, the self learning that self learning is further realized from the limited learning result is obtained directly or indirectly.

따라서, 본 발명에서는, 멀티모달 각각의 인식결과가 하나의 공간에 일관되게 표현되도록 하는 멀티모달 임베딩을 제안함으로써, 멀티모달 인식을 기반으로 하는 학습 시(학습 태스크) 기존의 멀티모달(음성/영상/자연어) 간 정렬/동기화 작업이 불필요해지고, 멀티모달 인식 시(인식 태스크) 기존의 참조성정보를 거치는 과정이 불필요해지도록 하며, 이로 인해 기존의 학습 방법이 갖는 단점(방대한 참조성정보 및 높은 연산량)이 개선되는 결과를 이끌어 낼 수 있다.Accordingly, in the present invention, by proposing a multimodal embedding in which the recognition result of each multimodal is consistently displayed in a single space, it is possible to provide a multimodal (learning task) / Natural language), and it is unnecessary to process the existing reference information at the time of the multimodal recognition (recognition task), and the disadvantages of the existing learning method (such as massive reference information and high Calculation amount) can be improved.

결국, 본 발명에 따르면, 멀티모달 인식을 기반으로 하는 학습에 있어서, 멀티모달 각각의 인식결과가 하나의 공간에 일관되게 표현되도록 하는 멀티모달 임베딩을 수행하는 새로운 학습(자가 학습) 방식을 제안함으로써, 전술한 기존의 학습 방법이 갖는 단점이 개선된 합리적이고 스마트한 멀티모달 인식이 가능한 환경을 조성하는 효과를 도출한다.As a result, according to the present invention, in a learning based on multimodal recognition, a new learning (self-learning) method for performing multimodal embedding so that the recognition result of each multimodal is consistently expressed in one space is proposed , And draws the effect of creating an environment in which the disadvantages of the existing learning method described above can be improved and a reasonable and smart multimodal recognition is possible.

이하에서는, 도 6을 참조하여, 본 발명의 바람직한 실시예에 따른 멀티모달 학습 방법을 설명하도록 한다.Hereinafter, a multimodal learning method according to a preferred embodiment of the present invention will be described with reference to FIG.

설명의 편의를 위해, 본 발명의 멀티모달 학습 방법을, 멀티모달학습장치(100)의 동작 방법으로 명명하여 설명하겠다. For convenience of explanation, the multimodal learning method of the present invention will be described as an operation method of the multimodal learning apparatus 100. FIG.

본 발명의 멀티모달 학습 방법 즉 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 서로 다른 2 이상의 인식기능을 통해, 객체를 인식한다(S110).The multimodal learning method of the present invention, that is, the method of operating the multimodal learning apparatus 100 according to the present invention, recognizes an object through two or more different recognition functions (S110).

그리고, 전술에서 언급한 예시와 같이, 사용자가 객체 예컨대 사과를 가리키면서 "이건 사과라고 해" 라고 말을 했다고 가정하겠다.And, suppose, as in the example mentioned in the above, the user has pointed to an object, for example an apple, and said "say this is an apple".

사용자가 객체 예컨대 사과를 가리키면서 "이건 사과라고 해" 라고 말을 하면(S100), 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 음성 인식기능을 통해 사용자의 음성 "이건 사과라고 해"을 인식하고, 영상 인식기능을 통해 촬영된 영상에서 사용자가 가리키는 이미지_사과를 인식하고, 자연어 인식기능을 통해 전술의 인식된 음성(이건 사과라고 해)을 자연어로 처리한 후 학습 또는 인식 대상인 단어_사과를 인식한다(S110).When the user refers to an object, for example, apology and says "say this apology" (S100), the method of operation of the multi-modal learning apparatus 100 according to the present invention uses a voice recognition function to " ", Recognizes the image_p applet indicated by the user in the image photographed by the image recognition function, processes the recognized speech (referred to as apology) of the aforementioned tactics in a natural language through the natural language recognition function, And recognizes the word apple (S110).

이에, 도 3에 도시된 바와 같이, 자연어 인식기능을 통해 인식된 인식데이터는 단어_사과, 음성 인식기능을 통해 인식된 인식데이터는 단어_사과와 동기되는 음성 "사과", 영상 인식기능을 통해 인식된 인식데이터는 이미지_사과인 것으로 설명하겠다.3, the recognition data recognized through the natural language recognition function includes a word_apple, a recognition data recognized through the voice recognition function, a voice_apple synchronized with the word_apple, and an image recognition function The recognized recognition data will be described as an image_page.

그리고, 음성/영상/자연어 인식기능 각각을 통해 인식된 인식데이터, 즉 인식데이터 "사과"(이하, 인식데이터(A)), 인식데이터 이미지_사과(이한 인식데이터(B)), 인식데이터 단어_사과(이하, 인식데이터(C))라고 명명하겠다.Then, recognition data recognized by each of the voice / video / natural language recognition functions, namely recognition data "apple" (hereinafter, recognition data A), recognition data image_apple (recognition data B) Apology (hereafter, recognition data (C)).

한편, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 객체에 대한 금번 인식이 학습 태스크인지 여부를 확인한다(S120).Meanwhile, in the operation method of the multimodal learning apparatus 100 according to the present invention, it is checked whether the current recognition of the object is a learning task (S120).

구체적으로, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 자연어 인식기능을 통해 자연어 처리된 문장(이건 사과라고 해)을 분석하여, 어떤 객체(Object)를 가르치고자 하는 형태의 문장인 것으로 판단되면 금번 객체에 대한 인식이 학습 태스크인 것으로 확인할 수 있다(S120 Yes).More specifically, the method of operating the multimodal learning apparatus 100 according to the present invention analyzes a sentence (referred to as an apology) processed in a natural language through a natural language recognition function and generates a sentence It can be confirmed that the recognition of this object is a learning task (S120 Yes).

본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 2 이상의 인식기능 즉 음성/영상/자연어 인식기능 중 적어도 하나의 인식기능과 관련된 연관 인식데이터를 수집한다(S130).The operation method of the multimodal learning apparatus 100 according to the present invention collects association recognition data related to at least one recognition function of two or more recognition functions, that is, a voice / video / natural language recognition function (S130).

즉, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 적어도 하나의 인식기능 즉 영상 인식기능과 관련된 연관 인식데이터 즉 다양한 이미지_사과들(이하, 인식데이터(B'),(B")...)을 수집한다(S130).That is, the operation method of the multimodal learning apparatus 100 according to the present invention includes at least one recognition function, that is, association recognition data related to the image recognition function, that is, various image_approses (hereinafter referred to as recognition data B ' ") &Lt; / RTI >

그리고, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 적어도 하나의 인식기능 즉 영상 인식기능의 인식데이터(B)를 전술의 수집된 연관 인식데이터 즉 인식데이터(B'),(B")...로 순차적으로 교체하면서, 본 발명에서 제안하는 멀티모달 임베딩 처리를 반복하는 딥 러닝 방식을 기반으로, 금번 객체 즉 사과에 대한 멀티모달 학습값을 결정하게 된다(S140).The operation method of the multimodal learning apparatus 100 according to the present invention is a method of operating at least one recognition function, that is, the recognition data B of the image recognition function is stored in the above-described collected association recognition data, that is, recognition data B ' The multi-modal learning value for the present object, that is, apple, is determined based on the deep learning method in which the multimodal embedding process proposed in the present invention is repeatedly performed (S140).

구체적으로 설명하면, 먼저 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 멀티모달 즉 음성/영상/자연어 인식기능을 통해 인식된 각 인식데이터(A),(B),(C)를 멀티모달 임베딩 처리한다.More specifically, an operation method of the multimodal learning apparatus 100 according to the present invention is a method of operating the multimodal learning apparatus 100 according to an embodiment of the present invention. To the multimodal embedding process.

즉, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 음성/영상/자연어 인식기능을 통해 인식된 인식데이터(A),(B),(C)를 결합하고, 결합한 인식데이터(A+B+C)를 전술의 수학식1에 따른 임베딩 함수를 기반으로 임베딩 처리할 수 있다.That is, the method of operating the multimodal learning apparatus 100 according to the present invention combines recognition data A, B, and C recognized through the voice / video / natural language recognition function, A + B + C) may be embedded based on the embedding function according to Equation (1).

그리고, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 영상 인식기능의 인식데이터(B)를 연관 인식데이터인 인식데이터(B')로 교체한 후, 각 인식데이터(A),(B'),(C)를 멀티모달 임베딩 처리한다.The operation method of the multimodal learning apparatus 100 according to the present invention is a method in which the recognition data B of the image recognition function is replaced with the recognition data B ' (B ') and (C) are subjected to multimodal embedding processing.

즉, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 음성/영상/자연어 인식기능을 통해 인식된 인식데이터(A),(B'),(C)를 결합하고, 결합한 인식데이터(A+B'+C)를 전술의 수학식1에 따른 임베딩 함수를 기반으로 임베딩 처리할 수 있다.That is, the method of operating the multimodal learning apparatus 100 according to the present invention combines recognition data A, B ', and C recognized through a voice / video / natural language recognition function, (A + B '+ C) can be embedded based on the embedding function according to Equation (1).

그리고, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 영상 인식기능의 인식데이터(B')를 연관 인식데이터인 인식데이터(B")로 교체한 후, 각 인식데이터(A),(B"),(C)를 멀티모달 임베딩 처리한다.The operation method of the multimodal learning apparatus 100 according to the present invention is a method in which the recognition data B 'of the image recognition function is replaced with the recognition data B' , (B "), and (C).

즉, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 음성/영상/자연어 인식기능을 통해 인식된 인식데이터(A),(B"),(C)를 결합하고, 결합한 인식데이터(A+B"+C)를 전술의 수학식1에 따른 임베딩 함수를 기반으로 임베딩 처리할 수 있다.That is, the method of operating the multimodal learning apparatus 100 according to the present invention combines recognition data A, B ", and C recognized through the voice / video / natural language recognition function, (A + B "+ C) can be embedded based on the embedding function according to Equation (1).

이와 같은 방식으로, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 영상 인식기능의 인식데이터(B)를 전술의 수집된 연관 인식데이터 즉 인식데이터(B'),(B")...로 순차적으로 교체하면서, 본 발명에서 제안하는 멀티모달 임베딩 처리를 반복하는 딥 러닝 방식을 기반으로, 금번 객체 즉 사과에 대한 멀티모달 학습값을 결정할 수 있다(S140).In this way, the operation method of the multimodal learning apparatus 100 according to the present invention is a method of operating the multimodal learning apparatus 100, in which the recognition data B of the image recognition function is stored in the above-described collected association recognition data, that is, the recognition data B ' The multimodal learning value for the current object, that is, apple, can be determined based on the deep learning method in which the multimodal embedding process proposed in the present invention is repeated while sequentially replacing the multimodal embedding process.

이후, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 금번 객체 즉 사과의 멀티모달 학습값에 기초하여, 음성/영상/자연어 인식기능 별로 금번 객체 즉 사과에 대한 인식결과로서의 인식벡터를 학습한다(S150).The operation method of the multimodal learning apparatus 100 according to the present invention is a method of recognizing a current object, that is, a recognition result as a recognition result for an applet, for each voice / video / natural language recognition function, based on the multi- (S150).

즉, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 멀티모달 임베딩 처리를 통해 얻은 하나의 멀티모달 학습값에 기초하여, 음성/영상/자연어 인식기능 별로 금번 객체 즉 사과에 대한 인식결과로서의 인식벡터를 학습하는 것이다.That is, the operation method of the multimodal learning apparatus 100 according to the present invention is a method for recognizing the current object, i.e., apology, for each voice / video / natural language recognition function based on one multimodal learning value obtained through the multimodal embedding process And learns the recognition vector as a result.

이 경우, 인식벡터(A), (B), (C)는, 하나의 벡터 공간 상에서 표현된 하나의 멀티모달 학습값에 기초하여 학습된 벡터로서, 하나의 벡터 공간 상에서 상호 동일한 벡터값이거나 또는 기 정의된 동일범위 내의 차이를 갖는 벡터값이다.In this case, the recognition vectors (A), (B), and (C) are vectors learned based on one multimodal learning value expressed in one vector space, Is a vector value having a difference within the same defined range.

다시 말해, 인식벡터(A), (B), (C)는, 본 발명에서 제안하는 새로운 임베딩 체계 즉 멀티모달 임베딩 체계를 통해 얻은 하나의 멀티모달 학습값에 기초하여, 금번 객체 즉 사과에 대한 인식결과로서 학습된 결과물이며, 따라서 이들은 하나의 벡터 공간 상에 사상된 상호 동일한 형태(벡터)를 갖는 동일한 벡터값(또는 동일한 것으로 볼 수 있는 벡터값)이 된다.In other words, the recognition vectors (A), (B), and (C) are generated based on one multimodal learning value obtained through the new embedding system proposed by the present invention, i.e., the multimodal embedding system, They are the result of learning as a recognition result, and thus they become the same vector value (or a vector value that can be regarded as the same) having the same shape (vector) mapped on one vector space.

그리고, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 금번 객체 즉 사과에 대한 학습 태스크에 따른 학습결과를 저장할 것이다(S160).The operation method of the multimodal learning apparatus 100 according to the present invention will store the learning result according to the learning task for the current object, i.e., apology (S160).

한편, S120단계에서 학습 태스크로 확인되지 않는 경우를 인식 태스크인 경우로 설명하겠다.On the other hand, a case in which the learning task is not confirmed in step S120 will be described as a recognition task.

예를 들어, S100단계에서 사용자가 객체 예컨대 사과를 가리키면서 "이게 뭐지?"라고 말을 했다고 가정할 수 있다.For example, in step S100, it can be assumed that the user has pointed to an object, such as an apple, and said, "What is this?"

이 경우라면, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, S110단계에서 음성 인식기능을 통해 사용자의 음성 "이게 뭐지"을 인식하고, 영상 인식기능을 통해 촬영된 영상에서 사용자가 가리키는 이미지_사과를 인식할 것이다.In this case, the operation method of the multimodal learning apparatus 100 according to the present invention is such that, in step S110, the user recognizes the voice "what is" through the voice recognition function, The image pointing _ will recognize the apples.

그리고, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 자연어 인식기능을 통해 자연어 처리된 문장(이게 뭐지)을 분석하여, 객체에 대한 금번 인식이 학습 태스크가 아닌 인식 태스크인 것을 확인할 것이다(S120 No).The operation method of the multimodal learning apparatus 100 according to the present invention analyzes a sentence (what is) processed in a natural language through the natural language recognition function to confirm that the present recognition of the object is a recognition task, not a learning task (S120 No).

이렇게 되면, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 음성/영상/자연어 인식기능 중 특정 인식기능을 통해 전술의 객체 즉 사과를 인식한 인식데이터가 확인되면, 앞서 저장한 학습결과에 근거하여 특정 인식기능의 인식데이터에 따른 인식벡터를 추출하고 특정 인식기능 외의 나머지 인식기능에서 금번 추출한 인식벡터에 따른 인식결과를 출력할 수 있다.The operation method of the multimodal learning apparatus 100 according to the present invention is characterized in that when recognition data recognizing the above-described object, that is, apology, is identified through a specific recognition function among the voice / video / natural language recognition functions, The recognition vector according to the recognition data of the specific recognition function is extracted based on the result, and the recognition result according to the recognition vector extracted from the recognition function other than the specific recognition function can be outputted.

즉, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 앞서 저장한 학습결과에 근거하여, 특정 인식기능을 즉 영상 인식기능에서 금번 인식된 인식데이터 이미지_사과에 따른 인식벡터(B)를 추출하고(S180), 특정 인식기능 즉 영상 인식기능 외의 나머지 인식기능 즉 음성/자연어 인식기능에서 금번 추출한 인식벡터(B)에 따른 인식결과를 출력할 수 있다(S190).That is, according to the method of operating the multimodal learning apparatus 100 according to the present invention, based on the learning results stored in advance, a specific recognition function is selected as a recognition vector B (S180). The recognition result according to the specific recognition function, that is, the recognizing function other than the image recognizing function, that is, the recognition vector (B) extracted by the voice / natural language recognizing function, can be output (S190).

다시 말하면, 본 발명에 따른 멀티모달학습장치(100)의 동작 방법은, 인식 태스크 시, 객체에 대하여 영상 인식기능을 통해 인식된 인식데이터 이미지_사과에 따른 인식벡터(B)를 추출하면, 기존에는 별도의 참조성정보로부터 인식벡터(B)와는 다른 공간 상에 표현되며 인식벡터(B)와 참조 관계가 연결된 다른 인식기능의 인식벡터를 찾아야 하는 것과 달리, 다른 인식기능 즉 음성/자연어 인식기능에서 인식벡터(B)에 따른 인식결과 즉 멀티모달 인식결과(예: 음성 "사과", 단어 사과)를 바로 출력하는 방식으로 사과를 구분해낼 수 있다.In other words, in the method of operating the multimodal learning apparatus 100 according to the present invention, when a recognition vector (B) according to the recognition data image_apple recognized through the image recognition function is extracted for the object, The recognition vector of another recognition function, which is expressed in a space different from the recognition vector B and which is connected to the recognition vector B by a reference relationship, must be found from the separate reference information, (For example, a voice "apple" and a word apple) immediately after the recognition result according to the recognition vector (B) in FIG.

이상에서 설명한 바에 따르면, 본 발명의 멀티모달학습장치 및 멀티모달 학습 방법은, 멀티모달 각각의 인식결과가 하나의 공간에 일관되게 표현되도록 하는 멀티모달 임베딩을 수행하는 새로운 학습(자가 학습) 방식을 제안함으로써, 전술한 기존의 학습 방법이 갖는 단점이 개선된 합리적이고 스마트한 멀티모달 인식이 가능한 환경을 조성하는 효과를 도출한다.As described above, the multimodal learning apparatus and the multimodal learning method of the present invention can be applied to a new learning (self-learning) method of performing multimodal embedding so that the recognition result of each multimodal is consistently displayed in one space The proposal leads to an effect of creating an environment in which the disadvantages of the existing learning method described above can be improved and a reasonable and smart multimodal recognition is possible.

한편, 여기에 제시된 실시예들과 관련하여 설명된 방법 또는 알고리즘 또는 제어기능의 단계들은 하드웨어로 직접 구현되거나, 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Meanwhile, the steps of a method or algorithm or control function described in connection with the embodiments disclosed herein may be embodied directly in hardware, or may be implemented in the form of a program instruction that may be executed via various computer means, . The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and configured for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

지금까지 본 발명을 바람직한 실시 예를 참조하여 상세히 설명하였지만, 본 발명이 상기한 실시 예에 한정되는 것은 아니며, 이하의 특허청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 또는 수정이 가능한 범위까지 본 발명의 기술적 사상이 미친다 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

본 발명의 멀티모달학습장치 및 멀티모달 학습 방법에 따르면, 멀티모달 인식을 기반으로 하는 학습에 있어서 멀티모달 각각의 인식결과가 하나의 공간에 일관되게 표현되도록 하는 멀티모달 임베딩을 제안한다는 점에서, 기존 기술의 한계를 뛰어 넘음에 따라 관련 기술에 대한 이용만이 아닌 적용되는 장치의 시판 또는 영업의 가능성이 충분할 뿐만 아니라 현실적으로 명백하게 실시할 수 있는 정도이므로 산업상 이용가능성이 있는 발명이다.According to the multimodal learning apparatus and the multimodal learning method of the present invention, in a multimodal recognition-based learning, multimodal embedding in which the recognition result of each multimodal is consistently expressed in one space is proposed, It is an invention that is industrially applicable because it is beyond the limit of the existing technology, and it is not only the use of the related technology, but also the possibility of commercialization or operation of the applied device is sufficient and practically possible.

100 : 멀티모달학습장치
110 : 멀티모달인식부 120 : 태스크확인부
130 : 멀티모달임베딩처리부 140 : 멀티모달학습부
150 : 데이터수집부 160 : 인식제어부100: Multi-modal learning device
110: Multimodal recognition unit 120: Task verification unit
130: Multimodal Embedding Processor 140: Multimodal Embedding Processor 140:
150: data collecting unit 160:

Claims

A multimodal recognition unit for recognizing an object through two or more different recognition functions;
A task confirmation unit for confirming whether recognition of the object is a learning task;
A multimodal embedding processing unit for combining recognition data recognized through each of the two or more recognizing functions and embedding the recognized recognition data and determining a multimodal learning value for the object in one vector space; And
And a multimodal learning unit for learning a recognition vector as a recognition result for the object for each of the two or more recognition functions based on the multimodal learning value of the object.

The method according to claim 1,
Each recognition vector determined for each of the at least two recognition functions,
Wherein the vectors are vector values that are the same as each other or have a difference within the same range defined in the vector space.

The method according to claim 1,
Further comprising a data collection unit for collecting association recognition data related to at least one of the two or more recognition functions.

The method of claim 3,
Wherein the multimodal embedding processor comprises:
And a deep learning method of repeating the recognition data combining and embedding processing while sequentially replacing the recognition data of the at least one recognition function with the collected association recognition data, Value of the multi-modal learning device.

The method of claim 3,
The two or more recognition functions include an image recognition function, a speech recognition function, and a natural language recognition function
The association recognition data may include:
And a related image retrieved based on a recognized word through a natural language recognition function in association with the image recognition function,
And an associated voice retrieved based on the word when the voice recognition function is related,
And a related word that is searched based on the word when it is related to the natural language recognition function.

The method according to claim 1 or 4,
A learning result according to the learning task is stored,
Extracting a recognition vector according to the recognition data of the specific recognition function based on the learning result and recognizing the remaining recognition other than the specific recognition function based on the learning result when the recognition data recognizing the object is identified through the specific recognition function among the two or more recognition functions Further comprising a recognition controller for outputting a recognition result according to the extracted recognition vector in the function.

A multimodal recognition step of recognizing an object through two or more different recognition functions;
A task checking step of checking whether the recognition of the object is a learning task;
A multimodal embedding step of combining recognition data recognized through each of the at least two recognizing functions and embedding the recognition data to determine a multimodal learning value for the object in one vector space; And
And a multimodal learning step of learning a recognition vector as a recognition result for the object for each of the two or more recognition functions based on the multimodal learning value of the object.

8. The method of claim 7,
Each recognition vector determined for each of the at least two recognition functions,
And a vector value having the same vector value or a difference within a predetermined range defined in the vector space.

8. The method of claim 7,
Further comprising collecting association recognition data associated with at least one of the at least two recognition functions;
Wherein the multimodal embedding comprises:
A multi-modal learning value for the object is determined based on a deep learning method of repeating the recognition data combining and embedding process while sequentially replacing the recognition data of the at least one recognition function with the collected association recognition data The method comprising the steps of:

10. The method according to claim 7 or 9,
Storing a learning result according to the learning task;
Extracting a recognition vector according to the recognition data of the specific recognition function based on the learning result and recognizing the remaining recognition other than the specific recognition function based on the learning result when the recognition data recognizing the object is identified through the specific recognition function among the two or more recognition functions And outputting a recognition result according to the extracted recognition vector in the function.