KR102122918B1

KR102122918B1 - Interactive question-anwering apparatus and method thereof

Info

Publication number: KR102122918B1
Application number: KR1020170142242A
Authority: KR
Inventors: 왕지현; 김현기; 이충희; 임수종; 최미란; 박상규; 배용진; 이형직; 임준호; 장명길; 허정
Original assignee: 한국전자통신연구원
Priority date: 2016-11-25
Filing date: 2017-10-30
Publication date: 2020-06-26
Also published as: KR20180059347A

Abstract

본 발명은 종래의 부자연스러운 형태의 질의문을 개선할수 있으며, 추가적인 정보를 모르는 상태에서는 질문이 불가능한 경우를 방지하여 마치 사람에게 대화하듯 질문을 할 수 있는 멀티모달 기반의 질의응답 방식을 제시한다. 또한, 이미지, 동영상, 오디오 등의 컨텐츠를 잠재적인 상품 구매자들에게 질의응답 서비스의 형태로 노출시킬 수 있기 때문에 광고 시장에 활용될 수 있는 환경을 제공한다.The present invention proposes a multi-modal based question-and-answer method that can improve a conventional unnatural form of a questionnaire, prevent a case where a question is impossible without knowing additional information, and ask a question as if talking to a person. In addition, it provides an environment that can be used in the advertising market because content such as images, videos, and audio can be exposed to potential product buyers in the form of a question-and-answer service.

Description

INTERACTIVE QUESTION-ANWERING APPARATUS AND METHOD THEREOF}

본 발명은 질문에 대한 정답을 제공하는 대화형 질의응답 장치 및 방법에 관한 것이다.The present invention relates to an interactive question-and-answer apparatus and method for providing a correct answer to a question.

종래의 질의응답 장치에서는, 입력된 질의에 대한 응답을 획득하는 과정에서 질의 내에서 응답을 찾기 위한 힌트 정보가 부족한 경우, 정확한 응답을 획득할 수 없는 경우가 빈번하다. 따라서, 종래의 질의응답 장치에서는, 질의에 대한 응답을 획득하는 프로세스에 앞서, 상기 힌트 정보를 포함하도록 질의를 생성하는 프로세스가 필요하다. 예를 들면, 질문자가 미술 박물관에서 특정 작품을 감상하는 상황에서, 상기 특정 작품의 제작 년도를 알고 싶은 경우, 질문자는 상기 특정 작품의 화가 이름 또는 상기 특정 작품의 제목과 관련된 힌트를 포함하도록 질의문을 생성하여 장치에 입력해야 한다. 만일, 질문자가 상기 특정 작품의 화가 이름 또는 상기 특정 작품의 제목을 정확히 알지 못한 경우, 적절한 질의문을 생성할 수 없다.In a conventional question-answering apparatus, when hint information for finding a response in a query is insufficient in a process of obtaining a response to an input query, it is often impossible to obtain an accurate response. Therefore, in a conventional question-and-answer apparatus, a process of generating a query to include the hint information is required prior to the process of obtaining a response to the query. For example, in a situation where a questioner views a particular piece of work in an art museum, when the questioner wants to know the production year of the specific piece, the questioner may ask a question to include a hint related to the name of the painter of the specific piece or the title of the specific piece Must be created and entered into the device. If the questioner does not know exactly the artist's name or the title of the specific work, an appropriate query cannot be generated.

이러한 문제를 해소하기 위해, 종래의 질의응답 장치는 여러 턴(Turn)에 걸친 질문자와 대화를 통해 적절한 질의를 생성하도록 유도하고 있다. 예를 들어, 질문자는 짧은 길이의 상품 이름이 포함된 질의를 질의응답 장치에 입력하면, 질의응답 장치는 상품의 이름과 유사도가 가장 높은 상품의 이름을 질문자에게 확인하는 과정을 수행하고, 이러한 확인과정은 정확한 상품을 찾을 때까지 질문자와 질의응답 장치 간의 여러 턴에 걸쳐 대화를 통해 반복 수행한다. In order to solve this problem, a conventional question-and-answer apparatus is leading to generate an appropriate query through a conversation with a questioner over several turns. For example, if the questioner inputs a query including a short product name into the Q&A device, the Q&A device performs a process of confirming to the questioner the name of the product with the highest similarity to the product name. The process is repeated through dialogue over several turns between the interrogator and the Q&A device until the correct product is found.

이와 같이, 종래의 질의응답 장치에서의 질의문 입력 방식은 질문자가 알고자 하는 대상과 관련된 정확한 정보를 모르는 경우에도 질문자와 질의응답 장치 간의 자연스러운 대화를 통해 질의가 완성된다. As described above, in the conventional question-and-answer apparatus, the query is completed through a natural conversation between the questioner and the question-and-answer apparatus even when the questioner does not know the exact information related to the object to be known.

그러나 이러한 대화형 질의문 입력 방식은 질문자와 질의응답 장치 간의 여러 턴에 걸쳐 반복 수행되는 대화로 인해 불편하다.However, such an interactive query input method is inconvenient due to a dialogue repeatedly performed over several turns between a questioner and a question and answer device.

상술한 종래의 문제점을 해결하기 위한 본 발명의 목적은, 질문자와 질의응답 장치 간에 반복 수행되는 대화를 통해 질의문을 생성하지 않고, 자연스러운 질의문을 그대로 이용하여 정확한 응답문을 제공하는 대화형 질의응답 장치 및 그 방법을 제공하는 데 있다.An object of the present invention for solving the above-described conventional problems is to create an interactive query between a questioner and a question-and-answer device without generating a query statement, and to provide an accurate response statement using a natural query statement as it is. It is to provide a response device and a method.

상술한 목적을 달성하기 위한 본 발명의 일면에 따른 사용자 단말에서의 대화형 질의응답 방법은, 상기 컴퓨터 프로세서가, 상기 서버로부터 멀티미디어 컨텐츠와 상기 멀티미디어 컨텐츠에서 사용자가 관심을 갖는 다수의 엔티티에 대한 메타정보를 수신하는 단계; 상기 컴퓨터 프로세서가, 상기 멀티모달 인터페이스로부터 상기 다수의 엔티티 중에서 상기 사용자가 선택한 엔티티에 대한 질의문을 수신하는 단계; 상기 컴퓨터 프로세서가, 상기 메타정보로부터 상기 사용자가 선택한 엔티티의 식별정보를 추출하는 단계; 및 상기 컴퓨터 프로세서가, 상기 엔티티의 식별정보와 상기 질의문을 상기 서버로 송신하고, 상기 질의문에 대한 응답문으로서, 상기 엔티티의 식별정보에 의해 제약된 정답 후보를 포함하는 상기 응답문을 상기 서버로부터 수신하는 단계를 포함한다.An interactive query response method in a user terminal according to an aspect of the present invention for achieving the above object, the computer processor, the multimedia content from the server and the meta for multiple entities of interest to the user in the multimedia content Receiving information; Receiving, by the computer processor, a query for the entity selected by the user among the plurality of entities from the multimodal interface; The computer processor extracting identification information of the entity selected by the user from the meta information; And the computer processor transmits the identification information of the entity and the query to the server, and as a response to the query, the response including the correct answer candidate constrained by the identification information of the entity is the And receiving from the server.

본 발명의 다른 일면에 따른 서버에서의 대화형 질의응답 방법은, 상기 컴퓨터 프로세서가, 멀티미디어 컨텐츠 내에서 사용자가 관심을 갖는 다수의 엔티티에 대한 식별정보를 포함하는 메타정보와 상기 식별정보에 부여된 속성정보를 생성하는 단계; 상기 컴퓨터 프로세서가, 상기 멀티미디어 컨텐츠와 상기 메타정보를 사용자 단말로 송신하는 단계; 상기 컴퓨터 프로세서가, 상기 다수의 엔티티 중에서 상기 사용자가 선택한 엔티티에 대한 질의문과, 상기 메타정보로부터 추출된 상기 사용자가 선택한 엔티티의 식별정보를 상기 사용자 단말로부터 수신하는 단계; 및 상기 컴퓨터 프로세서가, 상기 사용자가 선택한 엔티티의 식별정보에 부여된 속성정보를 기반으로 상기 질의문에 대한 응답문을 생성하고, 상기 응답문을 상기 사용자 단말로 송신하는 단계를 포함한다.In an interactive query response method in a server according to another aspect of the present invention, the computer processor is provided with the meta information including identification information for a plurality of entities of interest to the user in the multimedia content and the identification information. Generating attribute information; Transmitting, by the computer processor, the multimedia content and the meta information to a user terminal; Receiving, by the computer processor, a query statement for the entity selected by the user among the plurality of entities and identification information of the entity selected by the user extracted from the meta information from the user terminal; And generating, by the computer processor, a response to the query based on the attribute information given to the identification information of the entity selected by the user, and transmitting the response to the user terminal.

본 발명의 또 다른 일면에 따른 서버를 포함하는 대화형 질의응답 장치에서, 상기 서버는, 멀티미디어 컨텐츠, 멀티미디어 컨텐츠 내에서 사용자가 관심을 갖는 다수의 엔티티에 대한 식별정보를 포함하는 메타정보 및 상기 식별정보에 할당된 속성정보를 저장하는 저장유닛; 및 상기 멀티미디어 컨텐츠와 상기 메타정보를 사용자 단말로 송신하고, 상기 다수의 엔티티 중에서 상기 사용자가 선택한 엔티티에 대한 질의문과, 상기 메타정보로부터 추출된 상기 사용자가 선택한 엔티티의 식별정보를 상기 사용자 단말로부터 수신하고, 상기 사용자가 선택한 엔티티의 식별정보에 할당된 속성정보를 기반으로 상기 질의문에 대한 응답문을 생성하여 상기 응답문을 상기 사용자 단말로 송신하는 컴퓨터 프로세서를 포함한다.In an interactive question-and-answer apparatus including a server according to another aspect of the present invention, the server includes multimedia content, meta information including identification information for a plurality of entities of interest to the user in the multimedia content, and the identification A storage unit for storing attribute information assigned to the information; And transmitting the multimedia content and the meta information to a user terminal, and receiving a query statement for the entity selected by the user among the plurality of entities and identification information of the user selected entity extracted from the meta information from the user terminal. And a computer processor that generates a response to the query based on attribute information assigned to the identification information of the entity selected by the user and transmits the response to the user terminal.

본 발명에 따르면, 자연스러운 질의문에 대한 정확한 응답문을 직접 제공함으로써, 응답문과 관련된 힌트가 포함되도록 자연스러운 질의문을 질문자와 장치간에 반복 수행되는 대화를 통해 부자연스러운 질의문으로 변경해야 하는 번거로운 작업을 생략할 수 있다. 나아가 응답문과 관련된 힌트가 없거나 그 정보량이 매우 적은 자연스러운 질의문에 대해서도 정확한 응답문을 제공할 수 있다.According to the present invention, by providing the correct response to the natural query directly, the cumbersome task of changing the natural query to an unnatural query through repeated dialogue between the questioner and the device to include hints related to the response Can be omitted. Furthermore, it is possible to provide an accurate response to a natural query that has no hint related to the response or has very little information.

도 1은 본 발명의 일 실시 예에 따른 대화형 질의응답 장치의 블록도이다.
도 2는 도 1에 도시한 사용자 단말의 구성도이다.
도 3은 도 2에 도시한 멀티모달 인터페이스의 구성도이다.
도 4는 도 1에 도시한 컨텐츠 서버의 구성도이다.
도 5는 도 4에 도시한 저장유닛에 저장되는 정보들의 예이다.
도 6은 도 6는 도 1에 도시한 질의응답 서버의 구성도이다.
도 7 내지 9는 본 발명의 실시 예들에 따른 메타정보의 데이터구조를 나타낸 도면들이다.
도 10은 도 7에 도시한 화면좌표를 설명하기 위한 도면이다.
도 11은 본 발명의 일 실시 예에 따른 속성정보의 데이터구조를 나타낸 도면이다.
도 12는 본 발명의 일 실시 예에 따른 사용자 단말에서의 질의응답 방법을 나타내는 흐름도이다.
도 13은 본 발명의 일 실시 예에 따른 서버에서의 질의응답 방법을 나타내는 흐름도이다.
도 14은 도 13에 도시된 단계 S240의 상세 흐름도이다.1 is a block diagram of an interactive question-and-answer apparatus according to an embodiment of the present invention.
FIG. 2 is a configuration diagram of the user terminal illustrated in FIG. 1.
FIG. 3 is a configuration diagram of the multi-modal interface illustrated in FIG. 2.
FIG. 4 is a configuration diagram of the content server shown in FIG. 1.
5 is an example of information stored in the storage unit shown in FIG. 4.
FIG. 6 is a diagram illustrating the configuration of a question and answer server shown in FIG. 1.
7 to 9 are diagrams showing a data structure of meta information according to embodiments of the present invention.
FIG. 10 is a view for explaining screen coordinates illustrated in FIG. 7.
11 is a view showing a data structure of attribute information according to an embodiment of the present invention.
12 is a flowchart illustrating a method for answering questions in a user terminal according to an embodiment of the present invention.
13 is a flowchart illustrating a method for answering questions in a server according to an embodiment of the present invention.
14 is a detailed flowchart of step S240 shown in FIG. 13.

본 발명에 따른 동작 및 작용을 이해하는 데 필요한 부분을 중심으로 상세히 설명한다. 본 발명의 실시 예를 설명하면서, 본 발명이 속하는 기술 분야에 익히 알려졌고 본 발명과 직접적으로 관련이 없는 기술 내용에 대해서는 설명을 생략한다. 이는 불필요한 설명을 생략함으로써 본 발명의 요지를 흐리지 않고 더욱 명확히 전달하기 위함이다.It will be described in detail focusing on the parts necessary to understand the operation and operation according to the present invention. In describing the embodiments of the present invention, descriptions of technical contents well known in the technical field to which the present invention pertains and which are not directly related to the present invention will be omitted. This is to more clearly communicate the subject matter of the present invention by omitting unnecessary description.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 동일한 명칭의 구성 요소에 대하여 도면에 따라 다른 참조부호를 부여할 수도 있으며, 서로 다른 도면임에도 동일한 참조부호를 부여할 수도 있다. 그러나 이와 같은 경우라 하더라도 해당 구성 요소가 실시 예에 따라 서로 다른 기능을 갖는다는 것을 의미하거나, 서로 다른 실시 예에서 동일한 기능을 갖는다는 것을 의미하는 것은 아니며, 각각의 구성 요소의 기능은 해당 실시 예에서의 각각의 구성 요소에 대한 설명에 기초하여 판단하여야 할 것이다.In addition, in describing the components of the present invention, different reference numerals may be assigned to components having the same name according to the drawings, and the same reference numerals may be assigned to different components. However, even in such a case, it does not mean that the corresponding component has different functions according to embodiments, or does not mean that it has the same functions in different embodiments, and the function of each component is the corresponding embodiment You should judge based on the description of each component in.

본 발명의 실시 예들에 대한 설명에 앞서, 명세서 전반에 걸쳐 언급되는 용어 엔티티(Entity)가 정의된다.Prior to the description of the embodiments of the present invention, the term entity referred to throughout the specification is defined.

'엔티티(Entity)'는 이미지, 오디오, 동영상 등을 포함하는 멀티미디어 컨텐츠에서 사용자가 관심을 가질 것으로 예상하는 정보로서, 멀티미디어 컨텐츠에는 포함되지 않는 정보이다. 'Entity' is information that is expected to be of interest to a user in multimedia content including images, audio, and videos, and is not included in multimedia content.

엔티티의 예시 유형은, 장소명, 사건, 작품의 제작년도, 인물, 가전 제품, 의류, 출연 배우명, 장소명, 제작자, 출연배우가 착용하고 있는 의류, 신발, 가방, 가격, 색상 등을 포함하며, 이에 한정하지 않고, 컨텐츠의 종류에 따라 다양한 예시 유형을 더 포함할 수 있다.Exemplary types of entities include place name, event, year of production, person, household appliance, clothing, actor name, place name, producer, clothing worn by actors, shoes, bags, price, color, etc. The present invention is not limited thereto, and various example types may be further included according to types of contents.

이하, 도면을 참조하여 본 발명의 실시 예에 따른 대화형 질의 응답 장치에 대해 상세히 기술한다. Hereinafter, an interactive query response device according to an embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시 예에 따른 대화형 질의응답 장치의 블록도이다.1 is a block diagram of an interactive question-and-answer apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 대화형 질의응답 장치는 사용자 단말(100), 통신망(200) 및 서버(300)를 포함한다. Referring to FIG. 1, an interactive query response device according to an embodiment of the present invention includes a user terminal 100, a communication network 200, and a server 300.

상기 사용자 단말(100)은 통신망(200)에 접속 가능하도록 통신 기능을 구비하며, 통신망(200)을 통해 서버(300)와 통신할 수 있다.The user terminal 100 is provided with a communication function to be able to access the communication network 200, it is possible to communicate with the server 300 through the communication network 200.

상기 사용자 단말(100)은 통신망(200)을 통해 상기 서버(300)로부터 컨텐츠와 컨텐츠에 대한 메타정보를 수신할 수 있다. 상기 컨텐츠는, 정치, 경제, 사회, 교육, 방송, 연예, 스포츠, 홈쇼핑 등과 같은 다양한 분야의 멀티미디어 컨텐츠일 수 있다. 상기 멀티미디어 컨텐츠는, 이미지 컨텐츠, 동영상 컨텐츠 및 오디오 컨텐츠를 포함할 수 있다. 상기 메타정보는, 상기 컨텐츠에서 사용자가 관심을 가질 것으로 예측한 다수의 엔티티에 대한 정보로서, '메타데이터'로 지칭할 수도 있다. 이러한 메타정보는 다수의 엔티티를 식별하는 식별정보를 포함할 수 있다. 상기 식별정보는 사용자 단말(100)에 구비된 멀티모달 인터페이스에서 출력되는 멀티모달 정보의 속성을 갖도록 구성될 수 있다.The user terminal 100 may receive content and meta information about the content from the server 300 through the communication network 200. The content may be multimedia content in various fields such as politics, economy, society, education, broadcasting, entertainment, sports, home shopping, and the like. The multimedia content may include image content, video content, and audio content. The meta information is information on a plurality of entities predicted to be of interest to the user in the content, and may be referred to as'meta data'. The meta information may include identification information identifying a plurality of entities. The identification information may be configured to have attributes of multi-modal information output from the multi-modal interface provided in the user terminal 100.

상기 사용자 단말(100)은 상기 멀티모달 인터페이스로부터 입력되는 멀티모달 입력을 인식하여 상기 다수의 엔티티 중에서 상기 사용자가 선택한 엔티티에 대한 자연스러운 질의문을 생성한다. The user terminal 100 recognizes a multi-modal input input from the multi-modal interface to generate a natural query for the entity selected by the user among the multiple entities.

멀티모달 입력의 예시유형은 음성 입력, 키보드(또는 펜) 입력, 마우스 입력, 펜 입력, 터치스크린 입력, 제스처 입력 등을 포함하며, 본 발명에서는 편의상 음성 입력 또는 키보드(또는 펜) 입력을 이용하여 인식한 결과로부터 자연스러운 질의문을 생성하는 것을 가정한다. 그러나 본 발명이 이에 한정되는 것은 아니며, 서로 다른 둘 이상 또는 다수의 입력을 이용하여 구현할 수 있다.Exemplary types of multi-modal input include voice input, keyboard (or pen) input, mouse input, pen input, touch screen input, gesture input, etc. In the present invention, for convenience, voice input or keyboard (or pen) input is used. It is assumed that a natural query is generated from the recognized results. However, the present invention is not limited thereto, and may be implemented using two or more different inputs.

상기 사용자 단말(100)은 상기 서버(300)로부터 수신된 메타정보로부터 상기 멀티모달 인터페이스를 이용하여 선택한 엔티티의 식별정보를 추출한다. The user terminal 100 extracts identification information of a selected entity from the meta information received from the server 300 using the multi-modal interface.

상기 사용자 단말(100)은 상기 추출한 엔티티의 식별정보와 상기 자연스러운 질의문을 상기 통신망(200)을 통해 상기 서버(300)로 송신하고, 상기 서버(300)로부터 상기 질의문에 대한 응답문을 상기 통신망(200)을 통해 수신한다. The user terminal 100 transmits the identification information of the extracted entity and the natural query to the server 300 through the communication network 200, and receives a response to the query from the server 300. It is received through the communication network 200.

상기 엔티티의 식별정보는 상기 서버(300)에서 상기 질의문에 대한 응답문의 정답확률을 크게 높이는 역할을 한다. 즉, 상기 서버(300)는 상기 엔티티의 식별정보에 부여된 속성으로 정답후보를 제약하고, 이러한 정답후보가 포함된 응답문을 생성한다. 따라서, 본 발명의 사용자 단말(100)은 종래와 같이 응답문의 정답확률을 높이기 위해, 엔티티에 대한 힌트가 포함하도록 자연스러운 질의문을 부자연스러운 질의문으로 변경하는 불필요한 작업을 생략할 수 있다.The identification information of the entity serves to greatly increase the probability of a correct answer of a response to the query in the server 300. That is, the server 300 constrains the correct answer candidate with the attribute given to the identification information of the entity, and generates a response message including the correct answer candidate. Accordingly, the user terminal 100 of the present invention can omit unnecessary work of changing a natural query statement into an unnatural query statement so that a hint for an entity is included in order to increase a probability of correct answer of the response statement as in the related art.

상기 서버(300)는 컨텐츠 서버(310)와 질의응답 서버(330)를 포함한다. The server 300 includes a content server 310 and a question and answer server 330.

상기 컨텐츠 서버(310)는 컨텐츠 공급자가 운영하는 서버일 수 있다. The content server 310 may be a server operated by a content provider.

상기 컨턴츠 서버(310)는 컨텐츠 공급자가 제작한 멀티미디어 컨텐츠를 통신망(200)을 통해 상기 사용자 단말(100)로 송신한다.The content server 310 transmits the multimedia content produced by the content provider to the user terminal 100 through the communication network 200.

상기 컨텐츠 서버(300)는 상기 멀티미디어 컨텐츠에서 사용자가 관심을 가질 것으로 예측하는 다수의 엔티티에 대한 메타정보를 생성하고, 상기 멀티미디어 컨텐츠와 함께 상기 메타정보를 사용자 단말(100)로 송신한다.The content server 300 generates meta information about a plurality of entities predicted to be of interest to the user in the multimedia content, and transmits the meta information to the user terminal 100 together with the multimedia content.

상기 컨텐츠 서버(310)는, 상기 다수의 엔티티 중에서 상기 사용자가 멀티모달 인터페이스를 이용하여 선택한 엔티티의 식별정보와 상기 사용자가 선택한 엔티티에 대한 질의문을 상기 사용자 단말(100)로부터 수신한다.The content server 310 receives, from the user terminal 100, identification information of an entity selected by the user using a multi-modal interface and a query for the entity selected by the user among the plurality of entities.

상기 질의응답 서버(330)는 상기 질의문과 상기 엔티티의 식별정보를 상기 컨텐츠 서버(310)를 수신하고, 상기 엔티티의 식별정보에 할당된 속성정보를 기반으로 상기 질의문에 대한 응답문을 생성한다. The query and response server 330 receives the query and identification information of the entity, the content server 310, and generates a response to the query based on attribute information assigned to the entity's identification information. .

상기 질의응답 서버(330)는 상기 응답문을 상기 컨텐츠 서버(310)를 통해 사용자 단말(110)로 제공한다.The question and answer server 330 provides the response text to the user terminal 110 through the content server 310.

도 2는 도 1에 도시한 사용자 단말의 구성도이다. FIG. 2 is a configuration diagram of the user terminal illustrated in FIG. 1.

도 2를 참조하면, 사용자 단말(100)은 통신 기능을 갖는 컴퓨팅 장치로 구현될 수 있다. 상기 컴퓨팅 장치는, 예를 들면, 스마트폰, 태블릿, 노트북, 데스크탑 PC, 웨어러블 기기, 통신기능을 갖는 스마트 TV, 스마트 세탁기, 스마트 냉장고와 같은 가전 제품 등일 수 있으며, 그 밖에 쇼핑몰, 관광지, 박물관 등의 키오스크(kiosk)일 수 있다.Referring to FIG. 2, the user terminal 100 may be implemented as a computing device having a communication function. The computing device may be, for example, a smart phone, a tablet, a laptop, a desktop PC, a wearable device, a smart TV having a communication function, a household appliance such as a smart washing machine, or a smart refrigerator. Other shopping malls, tourist attractions, museums, etc. It may be a kiosk (kiosk).

상기 컴퓨팅 장치로 구현될 수 있는 사용자 단말(100)은 컴퓨터 프로세서(110), 멀티모달 인터페이스(120), 통신인터페이스(130), 메모리(140), 저장유닛(150), 음성 출력부(160) 및 이들을 연결하는 버스(170)을 포함할 수 있다.The user terminal 100 that can be implemented as the computing device includes a computer processor 110, a multi-modal interface 120, a communication interface 130, a memory 140, a storage unit 150, a voice output unit 160 And a bus 170 connecting them.

컴퓨터 프로세서(110)는 사용자 단말(100)의 전반적인 동작을 제어한다.The computer processor 110 controls the overall operation of the user terminal 100.

컴퓨터 프로세서(110)는 다수의 알고리즘을 실행하는 적어도 하나의 범용 프로세서를 포함할 수 있다. 범용 프로세서는 그래픽 연산에 특화된 그래픽 프로세서를 포함할 수 있다. 상기 알고리즘은, 예를 들면, 음성 인식, 음성 합성, 영상 인식 등과 관련된 알고리즘을 포함할 수 있다. 본 발명에서는 상기 알고리즘을 한정하는데 특징이 있는 것이 아니므로, 이에 대한 설명은 공지기술로 대신한다.Computer processor 110 may include at least one general purpose processor that executes multiple algorithms. The general-purpose processor may include a graphics processor specialized in graphics computation. The algorithm may include, for example, an algorithm related to speech recognition, speech synthesis, image recognition, and the like. In the present invention, there is no feature in limiting the algorithm, and the description thereof is replaced by a known technique.

컴퓨터 프로세서(110)는 멀티모달 인터페이스(120)로부터 출력되는 멀티모달 입력정보를 분석하여, 멀티모달 입력을 인식한다.The computer processor 110 analyzes multi-modal input information output from the multi-modal interface 120 to recognize the multi-modal input.

컴퓨터 프로세서(110)는 인식한 멀티모달 입력을 기반으로 상기 서버(300)로부터 송신된 멀티미디어 컨텐츠에서 사용자가 관심을 갖는 엔티티에 대한 자연스러운 질의문을 생성한다.The computer processor 110 generates a natural query for the entity of interest to the user from the multimedia content transmitted from the server 300 based on the recognized multi-modal input.

상기 질의문을 생성하기 위해, 상기 컴퓨터 프로세서(110)는 음성 입력 및/또는 키보드(펜) 입력을 포함하는 멀티모달 입력을 이용하여 질의문을 생성할 수 있다. In order to generate the query, the computer processor 110 may generate a query using multi-modal input including voice input and/or keyboard (pen) input.

일 예로, 컴퓨터 프로세서(110)는 음성인식 알고리즘을 기반으로 사용자의 발화음성을 인식하고, 그 인식결과를 기반으로 텍스트 형태의 질의문을 생성할 수 있다. For example, the computer processor 110 may recognize a user's speech voice based on a speech recognition algorithm, and generate a text-type query based on the recognition result.

다른 예로, 컴퓨터 프로세서(110)는 키보드(펜) 입력을 인식하고, 그 인식결과를 기반으로 텍스트 형태의 질의문을 생성할 수 있다. 키보드 입력을 인식하기 위해, 컴퓨터 프로세서(110)는 사용자 단말(100)에 구비된 표시화면상에 질의문을 입력하기 위한 입력창을 제공할 수 있다. As another example, the computer processor 110 may recognize a keyboard (pen) input, and generate a text type query based on the recognition result. In order to recognize keyboard input, the computer processor 110 may provide an input window for inputting a query on the display screen provided in the user terminal 100.

컴퓨터 프로세서(110)는 멀티모달 입력을 인식한 인식결과를 기반으로 상기 서버(300)로부터 송신된 메타정보로부터 상기 사용자가 관심을 갖는 엔티티의 식별정보를 추출한다. The computer processor 110 extracts identification information of an entity of interest to the user from meta information transmitted from the server 300 based on the recognition result of recognizing the multi-modal input.

메타정보로부터 엔티티의 식별정보를 추출하기 위해, 일 예로, 컴퓨터 프로세서(110)는 멀티모달 인터페이스(120)로부터 사용자가 선택한 엔티티의 터치 좌표를 수신하고, 상기 수신된 터치 좌표에 대응하는 상기 엔티티의 식별정보를 상기 메타정보로부터 추출할 수 있다. In order to extract the identification information of the entity from the meta information, for example, the computer processor 110 receives the touch coordinates of the entity selected by the user from the multi-modal interface 120, and of the entity corresponding to the received touch coordinates The identification information can be extracted from the meta information.

다른 예로, 컴퓨터 프로세서(110)는 멀티모달 인터페이스(120)로부터 질의문에 대응하는 사용자의 발화음성이 입력되는 입력 시간을 계산하고, 상기 입력 시간에 대응하는 상기 엔티티의 식별정보를 상기 메타정보로부터 추출할 수 있다. 여기서, 사용자의 발화음성이 입력되는 입력 시간은 상기 멀티미디어 컨텐츠(동영상 컨텐츠 또는 오디오 컨텐츠)의 재생 시작 시간으로부터 카운팅된 시간일 수 있다. 또 다른 예로, 상기 컴퓨터 프로세서(110)는 상기 멀티모달 인터페이스(120)로부터 상기 질의문에 대응하는 키보드(또는 펜) 입력이 입력되는 입력 시간을 계산하고, 상기 입력 시간에 대응하는 엔티티의 식별정보를 상기 메타정보로부터 추출할 수 있다.As another example, the computer processor 110 calculates an input time at which the user's speech voice corresponding to the query is input from the multi-modal interface 120, and identifies the entity's identification information corresponding to the input time from the meta information. Can be extracted. Here, the input time at which the user's speech voice is input may be a time counted from the playback start time of the multimedia content (video content or audio content). As another example, the computer processor 110 calculates an input time at which a keyboard (or pen) input corresponding to the query is input from the multi-modal interface 120, and identification information of an entity corresponding to the input time. Can be extracted from the meta information.

멀티모달 인터페이스(120)는 멀티미디어 컨텐츠에 포함된 다수의 엔티티 중에서 사용자가 선택한 엔티티에 대한 다수의 멀티모달 입력 정보를 생성한다. The multi-modal interface 120 generates a plurality of multi-modal input information for an entity selected by a user among a plurality of entities included in multimedia content.

다수의 멀티모달 입력 정보를 생성하기 위해, 멀티모달 인터페이스(120)는 도 3에 도시된 바와 같이, 음성 입력부(120-1), 키보드 입력부(120-3), 펜 입력부(120-5), 마우스 입력부(120-7), 터치스크린 입력부(120-9) 및 제스처 입력부(120-11)를 포함한다. In order to generate a plurality of multi-modal input information, the multi-modal interface 120 is a voice input unit 120-1, a keyboard input unit 120-3, a pen input unit 120-5, as shown in FIG. It includes a mouse input unit 120-7, a touch screen input unit 120-9, and a gesture input unit 120-11.

음성 입력부(120-1)는 질의문에 대응하는 사용자의 발화음성을 디지털 형태의 음성 입력 정보로 변환하는 것으로, 도시하지는 않았으나, 마이크와 같은 음성 수집기와 상기 음성 수집기에서 수집된 사용자의 발화음성을 음성 입력 정보로 변환하는 오디오 프로세서를 포함할 수 있다. The voice input unit 120-1 converts a user's speech voice corresponding to a query to digital voice input information, and although not shown, the speech collector of a microphone such as a microphone and the user's speech voice collected by the voice collector It may include an audio processor to convert the voice input information.

키보드 입력부(120-3)는 사용자가 선택한 엔티티에 대한 질의문을 직접 타이핑할 수 있게 하는 키보드를 포함할 수 있다. The keyboard input unit 120-3 may include a keyboard that allows a user to directly type a query for the selected entity.

펜 입력부(120-5)는 사용자가 선택한 엔티티에 대한 질의문을 표시화면상에 제공되는 입력창에 직접 작성할 수 있게 하는 전자 펜을 포함할 수 있다. The pen input unit 120-5 may include an electronic pen that allows a user to directly write a query for an entity selected by an input window provided on a display screen.

마우스 입력부(120-7)는 사용자가 선택한 엔티티에 대한 질의문 목록 중에서 사용자가 원하는 질의문을 클릭할 수 있게 하는 마우스를 포함할 수 있다. 여기서, 질의문 목록은 서버(300)에서 제공될 수 있다. 질의문 목록은 멀티미디어 컨텐츠에서 사용자가 선택한 엔티티에 대한 예상 질의문들을 사전 학습을 통해 생성한 목록일 수 있다.The mouse input unit 120-7 may include a mouse that allows a user to click a desired query from a list of query statements for the entity selected by the user. Here, the list of query statements may be provided by the server 300. The query statement list may be a list generated through pre-learning of expected query statements for an entity selected by a user in multimedia content.

터치 스크린 입력부(120-9)는 사용자가 선택한 엔티티의 터치 좌표를 제공할 수 있는 터치 스크린 또는 터치 패널이 탑재된 표시 장치를 포함할 수 있다. The touch screen input unit 120-9 may include a touch screen or a display device equipped with a touch panel capable of providing touch coordinates of a user-selected entity.

제스처 입력부(120-11)는 상기 사용자가 선택한 엔티티에 대한 제스처 입력을 제공할 수 있는 웨어러블 기기, 신체에 부착된 가속도 센서 및 자이로 센서, 사용자 움직임을 감지하는 카메라 센서를 포함할 수 있다. 제스처는, 예를 들면, 사용자가 표시화면에 표시되는 컨텐츠에서 특정 엔티티를 가리키는 손가락 제스처일 수 있다.The gesture input unit 120-11 may include a wearable device capable of providing a gesture input to the entity selected by the user, an acceleration sensor and a gyro sensor attached to the body, and a camera sensor detecting user movement. The gesture may be, for example, a finger gesture pointing to a specific entity in content displayed by the user on the display screen.

다시 도 2를 참조하면, 통신 인터페이스(130)는 통신망(200)과 사용자 단말(100)을 인터페이싱하는 역할을 한다. Referring again to FIG. 2, the communication interface 130 serves to interface the communication network 200 and the user terminal 100.

통신 인터페이스(130)는 컴퓨터 프로세서(110)에서 생성한 데이터 또는 정보를 상기 통신망(200)에서 정의하는 통신 규약에 따라 변환하고, 변환된 데이터를 유선 또는 무선통신으로 서버(300)에 송신하는 역할을 한다.The communication interface 130 serves to convert data or information generated by the computer processor 110 according to a communication protocol defined in the communication network 200 and transmit the converted data to the server 300 through wired or wireless communication. Do it.

메모리(140)는 컴퓨터 프로세서(110)가 멀티모달 인터페이스(120)로부터 수신한 정보 및 서버(300)로부터 수신한 정보를 처리 및 가공하기 위한 작업 공간, 즉, 메모리 공간을 제공한다. 메모리(140)는 휘발성 및 비휘발성 메모리를 포함한다.The memory 140 provides a working space, that is, a memory space, for the computer processor 110 to process and process information received from the multi-modal interface 120 and information received from the server 300. The memory 140 includes volatile and nonvolatile memory.

저장 유닛(150)은 서버(300)로부터 수신한 멀티미디어 컨텐츠 및 메타정보를 저장한다. The storage unit 150 stores multimedia content and meta information received from the server 300.

음성 출력부(160)는 서버(300)로부터 수신한 응답문을 음성으로 변환하여 출력하는 역할을 한다. 이러한 변환은 공지의 음성합성 알고리즘을 기반으로 수행될 수 있다.The voice output unit 160 converts and outputs a response message received from the server 300 into voice. Such conversion may be performed based on a known speech synthesis algorithm.

도 4는 도 1에 도시한 컨텐츠 서버의 구성도이다.FIG. 4 is a configuration diagram of the content server shown in FIG. 1.

도 4를 참조하면, 컨텐츠 서버(310)는 컴퓨터 프로세서(311), 메모리(313), 저장유닛(315), 통신 인터페이스(317) 및 출력부(319)를 포함한다.Referring to FIG. 4, the content server 310 includes a computer processor 311, a memory 313, a storage unit 315, a communication interface 317, and an output unit 319.

컴퓨터 프로세서(311)는 컨텐츠 서버(310)의 전반적인 동작을 제어한다.The computer processor 311 controls the overall operation of the content server 310.

컴퓨터 프로세서(311)는 다수의 알고리즘을 실행하는 적어도 하나의 범용 프로세서를 포함할 수 있다. 범용 프로세서는 그래픽 연산 처리에 특화된 그래픽 프로세서를 포함할 수 있다.The computer processor 311 may include at least one general purpose processor that executes multiple algorithms. The general purpose processor may include a graphic processor specialized in processing graphics.

컴퓨터 프로세서(311)는 멀티미디어 컨텐츠에서 사용자가 관심을 갖는 다수의 엔티티에 대한 메타정보를 생성하고, 생성된 메타정보를 상기 저장유닛(315)에 저장한다. The computer processor 311 generates meta information on a plurality of entities of interest to the user in multimedia content, and stores the generated meta information in the storage unit 315.

컴퓨터 프로세서(311)는 멀티미디어 컨텐츠에서 사용자가 관심을 갖는 다수의 엔티티에 대한 속성정보를 생성하고, 생성된 속성정보를 저장유닛(315)에 저장한다. The computer processor 311 generates attribute information for a plurality of entities of interest to the user in the multimedia content, and stores the generated attribute information in the storage unit 315.

메타정보를 생성하기 위해, 컴퓨터 프로세서(310)는 멀티미디어 컨텐츠에서 사용자가 관심을 갖는 다수의 엔티티를 분류한다. 엔티티 분류는 엔티티 분류 모델에 의해 수행될 수 있다. 엔티티 분류 모델은 멀티미디어 컨텐츠와 상기 멀티미디어 컨텐츠에서 사용자가 관심을 가질 것으로 예상되는 엔티티들 간의 상호 연관성을 학습한 학습 모델이다. 상기 상호 연관성을 학습하기 위해, 기계학습의 일종인 딥러닝 기법(deep learning)이 이용될 수 있다.To generate meta information, the computer processor 310 classifies a number of entities of interest to the user in multimedia content. Entity classification may be performed by an entity classification model. The entity classification model is a learning model that learns the correlation between multimedia content and entities expected to be of interest to the user in the multimedia content. In order to learn the correlation, a deep learning technique, which is a kind of machine learning, may be used.

컴퓨터 프로세서(311)는 상기 엔티티 분류 모델에 따라 분류된 다수의 엔티티들 각각에 식별정보를 할당하여 메타정보를 구성한다. The computer processor 311 configures meta information by assigning identification information to each of a plurality of entities classified according to the entity classification model.

컴퓨터 프로세서(311)는 상기 엔티티 분류 모델에 따라 분류된 다수의 엔티티들 각각에 할당된 식별정보에 속성명 및 속성값을 포함하는 속성정보를 구성한다.The computer processor 311 configures attribute information including attribute names and attribute values in identification information assigned to each of a plurality of entities classified according to the entity classification model.

컴퓨터 프로세서(311)는 메타정보 및 속성정보를 멀티미디어 컨텐츠와 함께 저장유닛(315)에 저장한다. The computer processor 311 stores meta information and attribute information together with the multimedia content in the storage unit 315.

저장유닛(315)은, 도 5에 도시된 바와 같이, 멀티미디어 컨텐츠가 저장되는 저장소(315-1), 엔티티 분류 모델이 저장되는 저장소(315-3) 및 상기 메타정보와 상기 속성정보가 저장된 저장소(315-5)를 포함한다. The storage unit 315, as shown in FIG. 5, is a storage 315-1 for storing multimedia content, a storage 315-3 for storing an entity classification model, and a storage for storing the meta information and the attribute information. (315-5).

컴퓨터 프로세서(311)는 저장유닛(315)에 저장된 멀티미디어 컨텐츠, 상기 멀티미디어 컨텐츠에 대한 메타정보를 상기 사용자 단말(100)에 송신하도록 상기 통신 인터페이스(317)를 제어한다.The computer processor 311 controls the communication interface 317 to transmit the multimedia content stored in the storage unit 315 and meta information about the multimedia content to the user terminal 100.

상기 컴퓨터 프로세서(311)는 저장유닛(315)에 저장된 멀티미디어 컨텐츠, 상기 멀티미디어 컨텐츠에 대한 메타정보 및 상기 메타정보에 대응하는 속성정보를 상기 질의응답 서버(330)에 송신하도록 상기 통신 인터페이스(340)를 제어한다. 따라서, 상기 컨텐츠 서버(310) 및 상기 질의응답 서버(330)는 상기 멀티미디어 컨텐츠, 상기 멀티미디어 컨텐츠에 대한 메타정보 및 상기 메타정보에 대응하는 속성정보를 공유한다.The computer processor 311 transmits the multimedia content stored in the storage unit 315, meta information about the multimedia content, and attribute information corresponding to the meta information to the query response server 330, the communication interface 340 To control. Accordingly, the content server 310 and the question and answer server 330 share the multimedia content, meta information about the multimedia content, and attribute information corresponding to the meta information.

통신 인터페이스(317)는 통신망(200)과 컨텐츠 서버(310)를 인터페이싱하는 역할을 한다. 상기 통신 인터페이스(317)는 컴퓨터 프로세서(311)의 제어에 따라, 상기 멀티미디어 컨텐츠와 상기 멀티미디어 컨텐츠에 대한 메타정보를 상기 통신망(200)에서 정의하는 통신 규약에 따라 변환하고, 변환된 데이터를 유선 또는 무선통신으로 사용자 단말(100)에 송신한다.The communication interface 317 serves to interface the communication network 200 and the content server 310. The communication interface 317 converts the multimedia content and meta information about the multimedia content according to a communication protocol defined in the communication network 200 under the control of the computer processor 311, and the converted data is wired or It is transmitted to the user terminal 100 through wireless communication.

컴퓨터 프로세서(311)는 상기 사용자 단말(100)로부터 사용자가 선택한 엔티티에 대한 식별정보와 상기 사용자가 선택한 엔티티에 대한 질의문을 수신하고, 이를 질의응답 서버(330)로 송신한다.The computer processor 311 receives the identification information for the entity selected by the user and the query for the entity selected by the user from the user terminal 100 and transmits it to the query response server 330.

한편, 메모리(313)는 상기 메타정보를 생성하기 위해 상기 컴퓨터 프로세서(310)에서 사용하는 프로그램, 실행 명령어 등이 실행될 수 있는 작업 공간을 제공한다.Meanwhile, the memory 313 provides a workspace in which programs, execution instructions, and the like used by the computer processor 310 can be executed to generate the meta information.

출력부(319)는 상기 컴퓨터 프로세서(310)에서 생성한 메타정보를 서버 관리자에게 표시하는 표시 기기 및 오디오를 출력하는 오디오 기기를 포함할 수 있다.The output unit 319 may include a display device displaying meta information generated by the computer processor 310 to a server administrator and an audio device outputting audio.

도 6는 도 1에 도시한 질의응답 서버의 구성도이다. 6 is a configuration diagram of a question and answer server illustrated in FIG. 1.

도 6를 참조하면, 질의응답 서버(330)는 상기 컨텐츠 서버(310)로부터 멀티미디어 컨텐츠에서 사용자가 선택한 엔티티의 식별정보와 상기 사용자가 선택한 엔티티에 대한 질의문을 수신하고, 상기 엔티티에 부여된 속성정보를 기반으로 상기 수신된 질의문에 대한 응답문을 생성하고, 상기 생성한 응답문을 상기 컨텐츠 서버(310)를 통해 사용자 단말(100)로 송신한다. 이때, 상기 질의응답 서버(330)는 상기 생성한 응답문을 상기 컨텐츠 서버(310)를 거치지 않고, 직접 상기 사용자 단말(100)에게 송신할 수도 있다.Referring to FIG. 6, the question and answer server 330 receives identification information of an entity selected by a user from the content server 310 and a query for the entity selected by the user from the content server 310, and attributes assigned to the entity Based on the information, a response to the received query is generated, and the generated response is transmitted to the user terminal 100 through the content server 310. In this case, the query and response server 330 may directly transmit the generated response text to the user terminal 100 without going through the content server 310.

이를 위해, 상기 질의응답 서버(330)는 컴퓨터 프로세서(331), 메모리(333), 저장 유닛(335), 통신 인터페이스(337) 및 출력부(339)를 포함한다. To this end, the question and answer server 330 includes a computer processor 331, a memory 333, a storage unit 335, a communication interface 337 and an output unit 339.

상기 컴퓨터 프로세서(331)는 상기 질의응답 서버(330)의 전반적인 동작을 제어하고, 상기 컨텐츠 서버(310)로부터 수신한 상기 질의문에 대한 응답문을 생성한다. 이때, 상기 컴퓨터 프로세서(331)는 상기 컨텐츠 서버(310)로부터 상기 질의문과 함께 수신한 식별정보에 대응하는 속성정보를 기반으로 상기 질의문에 대한 응답문을 생성한다. The computer processor 331 controls the overall operation of the question-and-answer server 330 and generates a response to the query received from the content server 310. At this time, the computer processor 331 generates a response to the query based on the attribute information corresponding to the identification information received together with the query from the content server 310.

상기 응답문을 생성하기 위해, 상기 컴퓨터 프로세서(331)는 질의응답 알고리즘을 실행한다. 즉, 상기 컴퓨터 프로세서(331)에 의해 실행되는 질의응답 알고리즘은 상기 컨텐츠 서버(310)로부터 수신된 상기 식별정보에 부여된 속성정보가 저장된 데이터베이스를 기반으로 응답문을 생성한다. 상기 질의응답 알고리즘을 기반으로 처리되는 질의응답 과정에 대해서는 아래에서 상세히 설명한다.To generate the response, the computer processor 331 executes a question-and-answer algorithm. That is, the question-and-answer algorithm executed by the computer processor 331 generates a response message based on a database in which attribute information given to the identification information received from the content server 310 is stored. The question and answer process processed based on the question and answer algorithm will be described in detail below.

메모리(333)는 상기 컴퓨터 프로세서(331)에서 실행하는 질의응답 알고리즘의 실행공간을 제공하는 것으로, 휘발성 및 비휘발성 메모리를 포함한다.The memory 333 provides an execution space of a question-and-answer algorithm executed by the computer processor 331, and includes volatile and nonvolatile memory.

저장유닛(335)는 상기 컨텐츠 서버(310)에서 제공하는 메타정보 및 상기 메타정보에 대응하는 속성정보를 저장한다.The storage unit 335 stores meta information provided by the content server 310 and attribute information corresponding to the meta information.

통신 인터페이스(337)는 상기 질의응답 서버(330)와 상기 컨텐츠 서버(310)를 인터페이싱하는 역할을 한다.The communication interface 337 serves to interface the query and answer server 330 and the content server 310.

출력부(339)는 상기 컴퓨터 프로세서(331)에 의해 생성된 응답문을 표시하는 표시 장치 및 오디오를 출력하는 오디오 장치를 포함한다.The output unit 339 includes a display device displaying a response generated by the computer processor 331 and an audio device outputting audio.

본 실시예에서는 컨텐츠 서버(310)와 질의응답 서버(330)가 분리된 것으로 설명하고 있으나, 하나의 서버로 통합될 수 있다.Although the content server 310 and the question and answer server 330 are described in this embodiment as separate, they can be integrated into one server.

도 7 내지 9는 도 1에 도시한 컨텐츠 서버에서 생성하는 메타정보의 데이터구조를 예시한 도면들이다. 7 to 9 are diagrams illustrating a data structure of meta information generated by the content server shown in FIG. 1.

도 7에서는, 멀티미디어 컨텐츠가 이미지 컨텐츠인 경우, 메타정보의 데이터 구조가 도시된다. 이미지에서의 메타정보는 이미지 파일 경로(71), 상기 이미지에 포함된 엔티티가 표시화면 상에서 위치하는 화면좌표(73, 75), 상기 엔티티의 고유 식별자(77, URI) 및 엔티티 속성명(79)을 포함한다. 여기서, 상기 화면좌표(73, 75)는, 엔티티를 둘러싸는 가상의 사각 영역을 정의할 때, 상기 사각 영역의 왼쪽 상단 모서리에 대응하는 왼쪽 상단 좌표(73)와 상기 사각 영역의 오른쪽 하단 모서리에 대응하는 오른쪽 하단 좌표(75)를 포함한다. 도 10에는 상기 화면좌표의 일 예를 도시한 것으로서, 사용자 단말(100)의 표시화면(10)에 나타나는 이미지는 3개의 가방들이 진열대(12)에 진열된 이미지로서, 이 이미지에서 엔티티는 가방이다. 사용자가 3개의 가방들 중 가운데 위치한 가방을 선택하기 위해 표시 화면(17)상에 터치한 터치 좌표가 왼쪽 상단 좌표(73)와 오른쪽 하단 좌표(75)를 포함하는 상기 사각 영역 내에 위치하면, 사용자 단말은 상기 가운데에 위치한 가방을 사용자가 선택한 엔티티로 인식하고, 메타정보에서 상기 인식된 엔티티의 고유 식별자(URI_50)를 추출한다. 상기 사용자 단말(100)은 상기 추출한 고유 식별자(URI_50)와 상기 엔티티에 대한 질의문을 서버(300)로 송신한다. 이때, 질의문은 '저 가방은 얼마지'일 수 있다. 예시된 질의문에는 가방의 제품번호, 사이즈, 색상에 대한 힌트 정보가 없지만, 상기 고유 식별자(URI_50)가 상기 힌트 정보를 대신하는 역할을 한다. 따라서, 사용자 단말(100)은 예시된 자연스러운 질의문을 가방의 제품번호, 사이즈, 색상에 대한 힌트 정보가 포함하도록 부자연스러운 질의문으로 변경하는 과정을 생략할 수 있다.In FIG. 7, when the multimedia content is image content, a data structure of meta information is illustrated. The meta information in the image includes an image file path 71, screen coordinates 73 and 75 in which an entity included in the image is located on a display screen, a unique identifier of the entity (77, URI), and an entity attribute name (79). It includes. Here, the screen coordinates (73, 75), when defining a virtual rectangular area surrounding the entity, the upper left coordinates (73) corresponding to the upper left corner of the rectangular area and the lower right corner of the rectangular area And the corresponding lower right coordinate 75. FIG. 10 shows an example of the screen coordinates. The image displayed on the display screen 10 of the user terminal 100 is an image in which three bags are displayed on the display stand 12, and in this image, the entity is a bag. . When the touch coordinates touched on the display screen 17 by the user to select a bag located in the middle of the three bags are located within the rectangular area including the upper left coordinate 73 and the lower right coordinate 75, the user The terminal recognizes the bag located in the center as an entity selected by the user, and extracts a unique identifier (URI_50) of the recognized entity from meta information. The user terminal 100 transmits the extracted unique identifier (URI_50) and a query for the entity to the server 300. In this case, the query may be'how much is that bag'. In the illustrated query, there is no hint information for the product number, size, and color of the bag, but the unique identifier (URI_50) serves as a substitute for the hint information. Accordingly, the user terminal 100 may omit the process of changing the illustrated natural query statement into an unnatural query statement to include hint information about the product number, size, and color of the bag.

도 8에서는, 멀티미디어 컨텐츠가 동영상 컨텐츠인 경우, 메타정보의 데이터 구조가 도시된다. 동영상 컨텐츠인 경우에서의 메타정보는 동영상 파일경로(81), 상기 동영상 컨텐츠 내에서 사용자가 관심을 갖는 엔티티가 재생되는 시간 구간(83, 85), 상기 시간 구간의 고유 식별자(87, URI) 및 상기 고유 식별자(87)에 할당된 엔티티 속성명(89)을 포함한다. 상기 시간 구간(83, 85)은 재생 시작 시간(83) 및 재생 종료 시간(85)를 포함한다. 사용자가 현재 재생되는 동영상 컨텐츠에 나타나는 장소명을 알고 싶은 경우, 질의문은 "저 장소는 어디지"일 수 있다, 이때, 상기 질의문에 대응하는 사용자 발화음성이 입력되는 입력시간이 재생 시작 시간(83) 및 재생 종료 시간(85) 사이에 존재하는 경우, 사용자 단말(100)은 도 8에 도시된 메타정보에서 재생 시작 시간(83) 및 재생 종료 시간(85)을 정의하는 시간 구간(83, 85)에 할당된 고유 식별자(URI_100)를 추출한다. 사용자 단말(100)은 상기 질의문과 상기 고유 식별자(URI_100)를 컨텐츠 서버로(310)로 송신한다.In FIG. 8, when the multimedia content is video content, a data structure of meta information is shown. Meta information in the case of video content includes a video file path 81, a time section 83, 85 in which an entity of interest in the video content is played, a unique identifier 87, URI of the time section, and It includes the entity attribute name (89) assigned to the unique identifier (87). The time periods 83 and 85 include a playback start time 83 and a playback end time 85. If the user wants to know the place name appearing on the currently played video content, the query may be "where is that place", where the input time at which the user's speech voice corresponding to the query is input is the playback start time (83 ) And the playback end time (85), the user terminal 100 in the meta information shown in FIG. 8, time intervals (83, 85) defining the playback start time (83) and the playback end time (85) ) To extract the unique identifier (URI_100). The user terminal 100 transmits the query and the unique identifier (URI_100) to the content server 310.

도 9에서는, 멀티미디어 컨텐츠가 오디오 컨텐츠인 경우, 메타정보의 데이터 구조가 도시된다. 오디오 컨텐츠인 경우에서의 메타정보는 오디오 파일경로(91), 상기 오디오 컨텐츠 내에서 사용자가 관심을 갖는 엔티티가 재생되는 시간 구간(93, 95), 상기 시간 구간의 고유 식별자(97, URI) 및 상기 고유 식별자(97)에 할당된 엔티티 속성명(99)을 포함한다. 전술한 동영상 컨텐츠에서 식별자를 추출하는 방식과 유사하게, 사용자 단말(100)은 상기 오디오 컨텐츠에서 사용자가 선택한 엔티티에 대한 질의문과 상기 엔티티에 대한 고유 식별자를 메타정보로부터 추출할 수 있다. 오디오 컨텐츠의 경우, 엔티티의 예시유형은 '노래 제목' 일 수 있고, 질의문의 예시 유형은 "지금 나오는 노래제목은?"일 수 있다. In FIG. 9, when the multimedia content is audio content, a data structure of meta information is shown. Meta information in the case of audio content includes an audio file path 91, a time section 93, 95 in which an entity of interest in the audio content is played, a unique identifier 97, URI of the time section, and It includes the entity attribute name (99) assigned to the unique identifier (97). Similar to the method of extracting an identifier from the video content described above, the user terminal 100 may extract a query statement for the entity selected by the user from the audio content and a unique identifier for the entity from meta information. In the case of audio content, an example type of an entity may be a'song title', and an example type of a query may be "What song title is coming out now?".

도 11은 도 1에 도시한 컨텐츠 서버에서 생성하는 속성정보의 데이터 구조를 예시한 도면이다.FIG. 11 is a diagram illustrating a data structure of attribute information generated by the content server shown in FIG. 1.

도 11을 참조하면, 속정 정보는 컨텐츠 서버(310)와 질의응답 서버(330)가 공유하는 정보로서, 상기 고유 식별자(21)에 할당되는 속성명(23)과 속성값(25)을 포함한다. 동일한 고유 식별자(URI_50)에 2개 이상의 속성명(23)과 속성값(25)이 각각 할당될 수 있다. 예를 들어, URI가 'URI_50'인 엔티티에는 이름, 제작자, 제작년도로 이루어진 3개의 속성명과 각 속성명에 대응하는 "최후의 심판", "미켈란 젤로" 및 "16세기경"로 이루어진 3개의 속성값이 할당될 수 있고, URI가 'URI_300'인 엔티티는 이름 및 가수로 이루어진 2개의 속성명과 각 속성명에 대응하는 "LET IT BE" 및 "비틀즈"로 이루어진 2개의 속성값이 할당될 수 있다. Referring to FIG. 11, the attribution information is information shared by the content server 310 and the question-and-answer server 330 and includes an attribute name 23 and an attribute value 25 assigned to the unique identifier 21. . Two or more attribute names 23 and attribute values 25 may be assigned to the same unique identifier (URI_50). For example, an entity with a URI of'URI_50' has three attribute names consisting of a name, producer, and year of production, and three attributes consisting of "Last Judgment", "Michelangelo", and "Around the 16th Century" corresponding to each attribute name. The attribute value can be assigned, and the entity whose URI is'URI_300' can be assigned two attribute names consisting of a name and a singer and two attribute values consisting of "LET IT BE" and "Beatles" corresponding to each attribute name. have.

전술한 실시 예에서는, 컨텐츠 서버(310)가 속성정보를 생성하여 질의응답 서버(330)에 제공하는 것으로 설명하고 있으나, 반대로, 질의응답 서버(330)에서 속성정보를 생성하여 컨텐츠 서버(310)에 제공할 수도 있다. 이 경우, 메타정보도 질의응답 서버(330)에서 생성하여 컨텐츠 서버(310)에 제공할 수도 있다.In the above-described embodiment, it is described that the content server 310 generates attribute information and provides it to the query and answer server 330, but on the contrary, the content server 310 generates the attribute information from the question and answer server 330 You can also provide In this case, meta information may also be generated by the question and answer server 330 and provided to the content server 310.

도 12는 본 발명의 일 실시 예에 따른 사용자 단말에서의 질의응답 방법을 나타내는 흐름도로서, 아래의 각 단계의 수행 주체는 상기 사용자 단말(110)에 구비된 컴퓨터 프로세서(110)로 가정한다.12 is a flowchart illustrating a method for answering questions in a user terminal according to an embodiment of the present invention, and it is assumed that the subject of each step below is the computer processor 110 provided in the user terminal 110.

도 12를 참조하면, 먼저, 단계 S110에서, 상기 컴퓨터 프로세서(110)가 상기 서버(300)로부터 멀티미디어 컨텐츠와 상기 멀티미디어 컨텐츠에 대한 메타정보를 수신한다. 여기서, 상기 메타정보는 상기 멀티미디어 컨텐츠에서 사용자가 관심을 갖는 다수의 엔티티에 대한 정보로서, 상기 다수의 엔티티를 식별할 수 있는 정보이다.Referring to FIG. 12, first, in step S110, the computer processor 110 receives multimedia content and meta information about the multimedia content from the server 300. Here, the meta-information is information on a plurality of entities of interest to the user in the multimedia content, and is information that can identify the plurality of entities.

이어, 단계 S120에서, 상기 컴퓨터 프로세서(110)가, 상기 멀티모달 인터페이스(120)로부터 상기 다수의 엔티티 중에서 상기 사용자가 선택한 엔티티에 대한 질의문을 수신한다. 상기 질의문은 상기 멀티모달 인터페이스에 포함된 음성 입력, 키보드 입력 및 펜 입력에 의해 제공될 수 있다. Subsequently, in step S120, the computer processor 110 receives a query for the entity selected by the user from among the plurality of entities from the multi-modal interface 120. The query may be provided by voice input, keyboard input and pen input included in the multi-modal interface.

이어, 단계 S130에서, 상기 컴퓨터 프로세서가, 상기 메타정보로부터 상기 사용자가 선택한 엔티티의 식별정보를 추출한다. 상기 메타정보로부터 엔티티의 식별정보를 추출하는 일 예는, 상기 컴퓨터 프로세서가 상기 멀티모달 인터페이스로부터 상기 사용자가 선택한 엔티티를 식별하는 멀티모달 입력정보를 수신하는 과정과, 상기 메타정보 내에서 상기 멀티모달 입력정보에 대응하는 상기 엔티티의 식별정보를 검색하는 과정 및 상기 검색된 식별정보를 상기 메타정보로부터 추출하는 과정을 포함할 수 있다. 상기 메타정보로부터 엔티티의 식별정보를 추출하는 다른 예는, 상기 멀티미디어 컨텐츠가 동영상 컨텐츠인 경우, 상기 동영상 컨텐츠의 재생 시작 시간을 기준으로 상기 질의문에 대응하는 사용자 발화음성이 입력되는 입력 시간을 계산하는 과정과 상기 계산된 입력 시간에 재생되는 상기 엔트리의 식별자를 상기 메타정보로부터 추출하는 과정을 포함한다. 상기 메타정보로부터 엔티티의 식별정보를 추출하는 또 다른 예는, 상기 멀티모달 인터페이스로부터 상기 오디오 컨테츠의 재생 시작 시간을 기준으로 상기 질의문에 대응하는 사용자 음성이 입력되는 입력 시간을 계산하는 과정과 상기 입력 시간에 재생되는 상기 오디오 컨텐츠에 포함된 상기 엔트리의 식별자를 상기 메타정보로부터 추출하는 과정을 포함한다.Subsequently, in step S130, the computer processor extracts identification information of the entity selected by the user from the meta information. An example of extracting the identification information of an entity from the meta information is a process in which the computer processor receives multi-modal input information identifying the entity selected by the user from the multi-modal interface, and the multi-modal within the meta information And searching for identification information of the entity corresponding to the input information and extracting the searched identification information from the meta information. Another example of extracting the identification information of the entity from the meta information, when the multimedia content is video content, calculates an input time for inputting a user's speech voice corresponding to the query based on a playback start time of the video content And extracting the identifier of the entry played at the calculated input time from the meta information. Another example of extracting the identification information of the entity from the meta-information is a process of calculating an input time at which a user voice corresponding to the query is input based on a playback start time of the audio contents from the multi-modal interface and the And extracting the identifier of the entry included in the audio content played at the input time from the meta information.

이어, 단계 S140에서, 상기 질의문과 상기 엔티티의 식별정보를 서버로 전송한다. Subsequently, in step S140, the query statement and the identification information of the entity are transmitted to a server.

이어, 단계 S150에서, 상기 컴퓨터 프로세서가, 상기 서버로부터, 상기 질의문에 대한 응답문을 수신한다. 이때, 상기 응답문은, 상기 엔티티의 식별정보 또는 상기 엔티티의 식별정보에 부여된 속성정보에 의해 제약된 다수의 정답후보를 포함한다.Subsequently, in step S150, the computer processor receives a response to the query from the server. At this time, the response statement includes a plurality of correct candidate candidates restricted by the identification information of the entity or the attribute information assigned to the identification information of the entity.

도 13은 본 발명의 일 실시 예에 따른 서버에서의 질의응답 방법을 나타내는 흐름도로서, 설명의 편의를 위해, 아래의 단계들은 컨텐츠 서버(310)와 질의응답 서버(330)가 통합된 하나의 서버에서 수행되는 것으로 가정한다. 다만, 아래의 단계들의 수행주체를 구분하는 경우, 단계 S210 내지 S230의 수행주체는 도 1에 도시한 컨텐츠 서버(310)일 수 있고, 단계 S240 및 S250의 수행주체는 도 1에 도시한 질의응답 서버(330)일 수 있다. 13 is a flowchart illustrating a method for answering a query in a server according to an embodiment of the present invention. For convenience of description, the following steps are one server in which the content server 310 and the query response server 330 are integrated. It is assumed to be performed in. However, when classifying the performance subjects of the following steps, the performance subjects of steps S210 to S230 may be the content server 310 shown in FIG. 1, and the performance subjects of steps S240 and S250 are question and answer shown in FIG. 1 It may be a server 330.

도 13을 참조하면, 먼저, 단계 S210에서, 서버(300)는 멀티미디어 컨텐츠에 대한 메타정보 및 속성정보를 생성한다. 메타정보는 멀티미디어 컨텐츠에서 사용자가 관심을 갖는 다수의 엔티티에 대한 식별정보를 포함한다. 이러한 메타정보를 생성하는 방법은 상기 멀티미디어 컨텐츠에서 사용자가 관심을 가질 것으로 예상되는 다수의 엔티티를 사전에 학습된 엔티티 분류 모델을 이용하여 분류하는 과정과, 상기 분류된 다수의 엔티티 각각에 대한 상기 식별정보와 상기 멀티모달 인터페이스의 멀티모달 입력정보를 비교하기 위해 상기 식별정보를 상기 멀티모달 입력정보의 속성을 갖도록 생성하는 과정과 상기 생성된 식별정보를 포함하도록 구성된 상기 메타정보를 생성하는 과정을 포함한다. 상기 메타정보에 포함된 식별정보의 예시유형은 상기 다수의 엔티티 각각의 고유 식별자, 상기 사용자 단말의 표시화면에 상기 다수의 엔티티 각각이 표시되는 화면 좌표, 상기 다수의 엔티티를 포함하는 동영상 컨텐츠가 재생되는 시간 구간, 상기 다수의 엔티티를 포함하는 오디오 컨텐츠가 재생되는 시간 구간 및 상기 다수의 엔티티 각각의 속성을 나타내는 속성명을 포함할 수 있다. 상기 속성정보는 상기 고유 식별자(URI)에 할당된 속성명과 속성값을 포함할 수 있다. 속성명과 속성값의 일 예는 도 11을 참조한 설명으로 대신한다.Referring to FIG. 13, first, in step S210, the server 300 generates meta information and attribute information for multimedia content. The meta information includes identification information for a plurality of entities of interest to the user in multimedia content. The method of generating the meta information includes the process of classifying a plurality of entities expected to be of interest to the user in the multimedia content using a pre-trained entity classification model, and identifying each of the plurality of classified entities Generating the identification information to have the attribute of the multi-modal input information and comparing the multi-modal input information of the multi-modal interface with the information and generating the meta information configured to include the generated identification information. do. Exemplary types of identification information included in the meta information include a unique identifier of each of the plurality of entities, screen coordinates in which each of the plurality of entities is displayed on a display screen of the user terminal, and video content including the plurality of entities is played. It may include a time period, a time period in which audio content including the plurality of entities is played, and an attribute name indicating the attribute of each of the plurality of entities. The attribute information may include an attribute name and an attribute value assigned to the unique identifier (URI). An example of the attribute name and attribute value is replaced with the description with reference to FIG. 11.

이어, 단계 S220에서, 서버(300)가 상기 멀티미디어 컨텐츠와 상기 메타정보를 사용자 단말(100)로 송신한다.Subsequently, in step S220, the server 300 transmits the multimedia content and the meta information to the user terminal 100.

이어, 단계 S230에서, 서버(300)가 상기 다수의 엔티티 중에서 상기 사용자가 선택한 엔티티에 대한 질의문과, 상기 메타정보로부터 추출된 상기 사용자가 선택한 엔티티의 식별정보를 상기 사용자 단말로부터 수신한다.Subsequently, in step S230, the server 300 receives a query statement for the entity selected by the user from among the plurality of entities, and identification information of the entity selected by the user extracted from the meta information from the user terminal.

이어, 단계 S240에서, 서버(300)가 상기 사용자가 선택한 엔티티의 식별정보에 할당된 속성정보(속성명 및 속성값)를 기반으로 상기 질의문에 대한 응답문을 생성한다. 즉, 서버(300)는 상기 사용자가 선택한 엔티티의 식별정보에 할당된 속성정보에 의해 제약(Constraint)되는 정답후보를 포함하는 응답문을 생성한다.Subsequently, in step S240, the server 300 generates a response to the query based on the attribute information (attribute name and attribute value) assigned to the identification information of the entity selected by the user. That is, the server 300 generates a response statement including a correct answer candidate constrained by attribute information assigned to the identification information of the entity selected by the user.

이어, 단계 S250에서, 서버(300)가 상기 응답문을 사용자 단말(100)로 송신한다.Next, in step S250, the server 300 transmits the response to the user terminal 100.

도 14은 도 13에 도시된 단계 S240의 상세 흐름도이다.14 is a detailed flowchart of step S240 shown in FIG. 13.

도 14를 참조하면, 단계 S240-1에서, 서버(300)가, 사용자 단말(100)로부터 송신된 질의문을 분석하고, 그 분석결과를 기반으로 질의문에서 질문중심어휘를 인식한다. Referring to FIG. 14, in step S240-1, the server 300 analyzes a query statement transmitted from the user terminal 100, and recognizes a question-oriented vocabulary in the query statement based on the analysis result.

질의문 분석은, 예를 들면, 형태소 분석(Morphological analysis), 개체명 분석(syntax analysis), 의미 분석(semantic analysis) 및 화용 분석(pragmatic analysis)을 포함하는 언어 처리 알고리즘을 기반으로 수행될 수 있다. 본 발명은 이러한 언어 처리 알고리즘에 특징이 있는 것이 아니므로, 이에 대한 설명은 공지기술로 대신한다. Query analysis may be performed based on a language processing algorithm including, for example, morphological analysis, syntax analysis, semantic analysis, and pragmatic analysis. . Since the present invention is not characteristic of such a language processing algorithm, the description thereof is replaced by known technology.

질문중심어휘는 질의문의 대상을 가리키는 단어로 정의할 수 있다. 예를 들어, "이 그림의 제작년도는?"라는 질의문에서 질문중심어휘는 '그림'이고, "저 빨간색 가방은 얼마지?"라는 질의문에서 질문중심어휘는 '가방'이고, "여기 어디야"라는 질의문에서 질문중심어휘는 '여기'이고, "지금 나오는 노래의 가수가 누구야"라는 질의문에서 질문중심어휘는 '노래'이다.The question-oriented vocabulary can be defined as a word indicating the object of the query. For example, in the question "What is the year of production of this picture?", the question-oriented vocabulary is'Picture', and in the question "How much is that red bag?", the question-oriented vocabulary is'Bag', and "Here In the question "Where are you?", the question-oriented vocabulary is "Here", and in the question "Who is the singer of the song coming out now?"

이어, 단계 S240-3에서, 서버(300)가 사용자 단말(100)로부터 송신된 식별정보를 기반으로 상기 질문중심어휘의 속성을 결정한다. 예를 들면, 서버(300)가 도 11에 도시된 속성정보가 저장된 데이터베이스를 조회하여 사용자 단말(100)로부터 송신된 식별정보에 포함된 식별자(URI)와 동일한 식별자를 검색한다. 데이터베이스에서 동일한 식별자가 확인되면, 확인된 식별자에 할당된 속성정보 즉, 속성값(도 11의 25)을 상기 질문중심어휘의 속성으로 결정한다.Subsequently, in step S240-3, the server 300 determines the attribute of the question-oriented vocabulary based on the identification information transmitted from the user terminal 100. For example, the server 300 searches the database in which the attribute information shown in FIG. 11 is stored and searches for the same identifier as the identifier (URI) included in the identification information transmitted from the user terminal 100. When the same identifier is identified in the database, the attribute information assigned to the identified identifier, that is, the attribute value (25 in FIG. 11) is determined as the attribute of the question-oriented vocabulary.

만일, 확인된 식별자에 할당된 속성값이 복수인 경우, 복수의 속성값 모두를 질문중심어휘의 속성으로 결정한다. 예를 들면, 사용자 단말(100)로부터 송신된 식별자가 URI_50이고, URI_50에 도 11에 도시된 바와 같이, '이름', '제작자', '제작년도'로 이루어진 3개의 속성이 할당된 경우, 3개의 속성 모두가 질문중심어휘의 속성으로 결정된다.If there are multiple attribute values assigned to the identified identifier, all of the multiple attribute values are determined as attributes of the question-oriented vocabulary. For example, if the identifier transmitted from the user terminal 100 is URI_50, and URI_50 is assigned three attributes consisting of'name','producer', and'production year', as shown in FIG. 11, 3 All of the dog's attributes are determined as attributes of the question-oriented vocabulary.

이어, 단계 S240-5에서, 서버(300)가 상기 결정된 속성을 나타내는 단어를 정답 후보로 선정하고, 선정된 정답후보가 포함된 응답문을 생성한다. 상기 결정된 속성을 나타내는 단어가 복수인 경우, 다수의 정답후보가 선정될 수 있다. 다수의 정답후보가 각각 포함된 다수의 응답문이 사용자 단말(100)로 송신될 수 있다. 사용자 단말은 다수의 응답문을 표시화면을 통해 사용자에게 제공하고, 사용자는 표시된 응답문에서 자신이 원하는 정답을 선택한다.Subsequently, in step S240-5, the server 300 selects a word representing the determined attribute as a candidate for a correct answer, and generates a response message including the selected candidate for a correct answer. When there are a plurality of words representing the determined attribute, a plurality of correct candidates may be selected. A plurality of response statements each including a plurality of correct candidate candidates may be transmitted to the user terminal 100. The user terminal provides a plurality of response texts to the user through the display screen, and the user selects the correct answer desired by the displayed response text.

이상에서 설명한 실시 예들은 그 일 예로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The embodiments described above are examples, and those skilled in the art to which the present invention pertains may make various modifications and variations without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical spirits within the equivalent range should be interpreted as being included in the scope of the present invention.

Claims

In a method for answering questions at a user terminal including a computer processor communicating with a server through a communication network,
The computer processor receiving, from the server, multimedia content and meta-information for a plurality of entities expected to be of interest to the user in the multimedia content;
Receiving, by the computer processor, a query for the entity selected by the user from among the plurality of entities from a multimodal interface;
The computer processor extracting identification information of the entity selected by the user from the meta information; And
The computer processor transmits the identification information of the entity and the query to the server, and as a response to the query, the response including the correct answer candidate limited by the identification information of the entity is sent to the server It includes; receiving from;
The meta information,
The server classifies a plurality of entities using a pre-trained entity classification model, and generates the identification information for each of the classified entities based on the multi-modal interface. How to answer questions.

In claim 1, The extraction step,
Receiving multi-modal input information identifying the entity selected by the user from the multi-modal interface;
Searching for identification information of the entity corresponding to the multi-modal input information in the meta information; And
Extracting the retrieved identification information from the meta information
Interactive question and answer method in a user terminal that includes.

The method of claim 1, wherein when the multimedia content is image content displayed on a display screen of the user terminal, the extracting step includes:
Receiving touch coordinates of the entity selected by the user from the multi-modal interface; And
Extracting the identification information of the entry corresponding to the touch coordinates from the meta information
Interactive question and answer method in a user terminal that includes.

In claim 1, If the multimedia content is video content, the extracting step,
Calculating an input time at which a user voice corresponding to the query is input based on a playback start time of the video content from the multi-modal interface; And
Extracting identification information of an entry reproduced at the calculated input time from the meta information;
Interactive question and answer method in a user terminal that includes.

In claim 1, If the multimedia content is audio content, the extracting step,
Calculating an input time at which a user voice corresponding to the query is input based on a playback start time of the audio content from the multi-modal interface; And
Extracting identification information of an entry included in the audio content played at the input time from the meta information
Interactive question and answer method in a user terminal that includes.

In claim 1, The identification information,
The unique identifier of each of the plurality of entities, the coordinates of a region in which each of the plurality of entities are displayed on the display screen of the user terminal, a time period during which video content including the plurality of entities is played, and the plurality of entities And a time period during which the audio content is played and an attribute name indicating the attribute of each of the plurality of entities.

In the interactive question-and-answer method in a server including a computer processor communicating with a user terminal through a communication network,
Generating, by the computer processor, meta information including identification information for a plurality of entities of interest to the user in the multimedia content and attribute information given to the identification information;
Transmitting, by the computer processor, the multimedia content and the meta information to a user terminal;
Receiving, from the user terminal, a query statement for the entity selected by the user among the plurality of entities and identification information of the entity selected by the user extracted from the meta information; And
And generating, by the computer processor, a response to the query based on the attribute information given to the identification information of the entity selected by the user, and transmitting the response to the user terminal,
The step of generating the meta information,
Classifying a plurality of entities expected to be of interest to the user in the multimedia content using a previously learned entity classification model;
Generating the identification information for each of the classified plurality of entities based on a multi-modal interface; And
Generating the meta information including the generated identification information
Interactive question-and-answer method in the server comprising a.

delete

In claim 7, The step of generating the identification information based on the multi-modal interface,
Generating the identification information to have an attribute of the multi-modal input information, in order to compare the multi-modal input information and the identification information output from the multi-modal interface; And
Generating the meta information to include the generated identification information
Interactive question-and-answer method in the server comprising a.

In claim 7, the identification information,
Each unique identifier of the plurality of entities, a screen coordinate in which the plurality of entities are located on a display screen, a time period in which entities included in video content are played, a time period in which entities included in audio content are played, and the plurality of entities Interactive question-and-answer method in the server that includes an attribute name indicating each attribute.

In claim 7, The step of transmitting the response to the user terminal,
Analyzing a query statement transmitted from the user terminal, and recognizing a question-oriented vocabulary in the query statement;
Determining attribute information of the question-centered vocabulary based on identification information transmitted from the user terminal;
Selecting the attribute information as a candidate for the correct answer and generating the response statement including the selected candidate for correct answer
Interactive query response method in the server that includes.

In claim 11, The step of determining the attribute information,
Searching the database in which the attribute information is stored, and searching for the same identifier as the identifier included in the identification information transmitted from the user terminal; And
If the same identifier is identified in the database, determining attribute information assigned to the identified identifier as attribute information of the question-oriented vocabulary.
Interactive query response method in the server that includes.

In an interactive question-and-answer apparatus including a server including a computer processor communicating with a user terminal through a communication network,
The server,
A storage unit for storing multimedia information, meta information including identification information for a plurality of entities of interest to the user in the multimedia content, and attribute information assigned to the identification information; And
The multimedia content and the meta information are transmitted to the user terminal, and among the plurality of entities, a query statement about the entity selected by the user and identification information of the entity selected by the user extracted from the meta information are received from the user terminal. , A computer processor that generates a response to the query based on the attribute information assigned to the identification information of the entity selected by the user and transmits the response to the user terminal,
The computer processor,
In the multimedia content, a plurality of entities expected to be of interest to the user are classified using a pre-trained entity classification model, and the identification information for each of the classified multiple entities is generated based on a multi-modal interface, The interactive question-and-answer device that generates and stores the meta information including the generated identification information in the storage unit.

delete

The computer processor of claim 13,
In order to compare the multi-modal input information and the identification information output from the multi-modal interface, the interactive question-and-answer device to generate the identification information to have the attributes of the multi-modal input information.

The computer processor of claim 13,
Each unique identifier of the plurality of entities, a screen coordinate in which the plurality of entities are located on a display screen, a time period in which entities included in video content are played, a time period in which entities included in audio content are played, and the plurality of entities And generating the meta information to include the identification information including an attribute name indicating each attribute, and storing the meta information in the storage unit.

The computer processor of claim 13,
Analyze the query statement transmitted from the user terminal, recognize the question-oriented vocabulary in the query statement, determine the attribute information of the question-centered vocabulary based on the identification information transmitted from the user terminal, and determine the determined attribute information. An interactive question-and-answer device that selects a candidate as a correct answer and generates the response sentence including the selected correct answer candidate.

The computer processor of claim 17,
The storage unit in which the attribute information is stored is searched to search for an identifier identical to the identifier included in the identification information transmitted from the user terminal, and when the same identifier is confirmed in the storage unit, the attribute information assigned to the identified identifier is recalled. An interactive question-and-answer device that is determined by attribute information of a question-oriented vocabulary.