JP2022135734A

JP2022135734A - Program, device, and method for interacting in small-talk style by using multi-modal knowledge graph

Info

Publication number: JP2022135734A
Application number: JP2021035724A
Authority: JP
Inventors: 博楊; Hiroshi Yo; 剣明呉; Jiangming Wu; 元服部; Hajime Hattori
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2022-09-15
Anticipated expiration: 2041-03-05
Also published as: JP7486263B2

Abstract

To provide a device, a method, and a program for interacting in a small-talk style by using a multi-modal knowledge graph.SOLUTION: An interaction device has: an utterance feature vector generation unit that generates an utterance feature vector from an utterance sentence from teacher data during training between entity words associated with an entity object image by using a knowledge graph linked by relational words; a knowledge graph searching unit that detects one or more utterance entity words included in the utterance sentence from the teacher data and searches for the relational words and the entity words from the utterance entity words by using the knowledge graph; a knowledge feature vector generation unit that generates a knowledge feature vector from the entity words, the entity object image, and the relational words; a coupling layer that couples the utterance feature vector and the knowledge feature vector to generate a coupled utterance feature vector; a response feature vector generation unit that generates a response feature vector from a response sentence of the teacher data corresponding to the utterance sentence of the teacher data and a response object image; and an encoder-decoder that receives input of the coupled utterance feature vector and outputs the response feature vector.SELECTED DRAWING: Figure 1

Description

本発明は、ユーザと自然な対話を実現する対話エージェントの技術に関する。 The present invention relates to technology of a dialogue agent that realizes natural dialogue with a user.

ユーザとの対話システムとしては、テキストベースが一般的である。端末は、ユーザインタフェースとして機能し、ユーザの発話音声を対話システムへ送信する。対話システムは、その発話文に対して自然な対話となる応答文を生成し、その応答文を端末へ返信する。そして、端末は、その応答文を音声又はテキストによって、ユーザへ返答する。このような対話システムとしては、例えば「Siri（登録商標）」や「しゃべってコンシェル（登録商標）」がある。 Text-based systems are generally used for interactive systems with users. The terminal functions as a user interface and transmits the user's spoken voice to the dialogue system. The dialogue system generates a response sentence that becomes a natural dialogue for the utterance sentence, and returns the response sentence to the terminal. Then, the terminal replies the response sentence to the user by voice or text. Examples of such dialogue systems include "Siri (registered trademark)" and "Talking Concier (registered trademark)."

これに対し、マルチモーダルな対話システムが期待されている。この対話システムは、ユーザとの間で、テキスト、音声及び画像など複数のコミュニケーションモードで、対話をやりとりすることができる。特に、ＡＩ(Artificial Intelligence)を用いた雑談対話システムによれば、マルチモーダル情報に応じて自然な応答文を返答することができ、ユーザの対話意欲を高めることが期待される。 In contrast, multimodal dialogue systems are expected. The dialog system is capable of interacting with a user in multiple modes of communication such as text, voice and images. In particular, chat dialogue systems using AI (Artificial Intelligence) are expected to be able to respond with natural response sentences in response to multimodal information, and to increase user motivation for dialogue.

また、豊富な知識を含む対話のやりとり実現するために、知識グラフを活用した対話システムの技術もある。「知識グラフ」とは、実体同士の間の関係を記述して作成したグラフである。即ち、実体語を「ノード」として、実体語間の関係語を「リンク」とすることによって作成される。 There is also technology for dialogue systems that make use of knowledge graphs in order to realize exchanges of dialogues that include a wealth of knowledge. A "knowledge graph" is a graph created by describing the relationships between entities. That is, it is created by setting the entity words as "nodes" and the related words between the entity words as "links".

従来、知識グラフの概念遷移を考慮して、テキストベースの対話における応答文を自動的に生成する技術がある（例えば非特許文献１参照）。
また、マルチドメインのトピック（映画、音楽、旅行）によって知識グラフを構築する技術もある（例えば非特許文献２参照）。この技術によれば、雑談対話コーパスKdConvを用いて、知識を融合した応答文を生成する。
更に、特定のタスク向けの知識グラフを用いて、対話の応答文を生成する技術もある（例えば非特許文献３参照）。この技術によれば、オンラインモールのサービスセンタが、対話文と商品写真とからなる商品知識グラフを用いて、ユーザとセールスオペレータとの間で、マルチモーダルな対話の応答文を生成する。
更に、ユーザの発話文から主要概念を生成し、タスク知識ベースと一般知識ベースの両方を参照して、応答文を生成する技術もある（例えば特許文献１参照）。 Conventionally, there is a technique of automatically generating a response sentence in a text-based dialogue in consideration of the concept transition of a knowledge graph (see Non-Patent Document 1, for example).
There is also a technique for constructing a knowledge graph using multi-domain topics (movies, music, travel) (see Non-Patent Document 2, for example). According to this technique, a conversational dialogue corpus KdConv is used to generate a response sentence that integrates knowledge.
Furthermore, there is also a technique for generating dialogue response sentences using a knowledge graph for a specific task (see, for example, Non-Patent Document 3). According to this technology, the service center of the online mall generates response sentences for multimodal dialogue between the user and the sales operator using a product knowledge graph consisting of dialogue sentences and product photographs.
Furthermore, there is also a technique of generating a main concept from a user's utterance sentence, referring to both a task knowledge base and a general knowledge base, and generating a response sentence (see Patent Document 1, for example).

特開２０１７－２２４２０４号公報JP 2017-224204 A

Houyu Zhang, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, “Grounded conversation Generation as Guided Traverses in Commonsense Knowledge Graphs”（2020）、[online]、［令和３年２月２１日検索］、インターネット＜URL: https://arxiv.org/pdf/1911.02707.pdf＞Houyu Zhang, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, “Grounded conversation Generation as Guided Traverses in Commonsense Knowledge Graphs” (2020), [online], [searched on February 21, 2021], Internet <URL: https: //arxiv.org/pdf/1911.02707.pdf> Hao Zhou, Chujie Zheng, Kaili Huang, Minlie Huang, Xiaoyan Zhu, “KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation”(2020)、[online]、［令和３年２月２１日検索］、インターネット＜URL:https://www.aclweb.org/anthology/2020.acl-main.635.pdf＞Hao Zhou, Chujie Zheng, Kaili Huang, Minlie Huang, Xiaoyan Zhu, “KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation” (2020), [online], [February 21, 2021 day search], Internet <URL: https://www.aclweb.org/anthology/2020.acl-main.635.pdf> Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, Tat-Seng Chua, “Knowledge-aware Multimodal Dialogue Systems”（2020）、[online]、［令和３年２月２１日検索］、インターネット＜URL:https://nextcenter.org/wp-content/uploads/2020/04/Knowledge-Aware.pdf＞Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, Tat-Seng Chua, “Knowledge-aware Multimodal Dialogue Systems” (2020), [online], [searched on February 21, 2021], Internet <URL: https ://nextcenter.org/wp-content/uploads/2020/04/Knowledge-Aware.pdf＞

非特許文献１に記載の技術によれば、知識グラフの概念遷移としてbook, bag, hope, based, futureなどの限定された語彙同士と結んだものである。そのために、知識としての関連トピック（関連説明文）は記述されおらず、このような知識グラフを適用しても、知識を含む雑談的に対話をすることはできない。
非特許文献２に記載の技術によれば、知識グラフがテキストべースに限定されているために、画像を含むマルチモーダルな雑談対話をすることはできない。
非特許文献３に記載の技術によれば、商品サービス販売のような所定のタスク向けの知識グラフを適用したものであって、豊富な知識に基づくマルチモーダルな雑談対話をすることはできない。
特許文献１に記載の技術によれば、ルールベースの応答生成方式であって、大量の教師データから自動的に応答文を生成するものではない。また、タスク知識ベースと一般知識ベース両方とも、soda, code, tea, hot, soupなどの単語で構成されているに過ぎず、関連トピックまでも記述されていない。 According to the technique described in Non-Patent Document 1, the concept transition of the knowledge graph is connected with limited vocabularies such as book, bag, hope, based, and future. For this reason, related topics (related explanations) as knowledge are not described, and even if such a knowledge graph is applied, it is not possible to have a casual conversation involving knowledge.
According to the technology described in Non-Patent Document 2, since the knowledge graph is limited to a text base, it is not possible to have a multimodal chat dialogue including images.
According to the technique described in Non-Patent Document 3, a knowledge graph is applied to a predetermined task such as sales of goods and services, and multimodal chat dialogue based on abundant knowledge cannot be conducted.
The technique described in Patent Document 1 is a rule-based response generation method, and does not automatically generate response sentences from a large amount of teacher data. Moreover, both the task knowledge base and the general knowledge base consist only of words such as soda, code, tea, hot, and soup, and do not even describe related topics.

これに対し、本願の発明者らは、関連トピックや関連画像を含む知識グラフを構築することによって、画像を含むマルチモーダルな雑談対話をすることはできないか、と考えた。 On the other hand, the inventors of the present application have considered whether it is possible to have a multimodal chat dialogue including images by constructing a knowledge graph including related topics and related images.

そこで、本発明は、マルチモーダルな知識グラフを用いて雑談的に対話するプログラム、装置及び方法を提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a program, apparatus, and method for casually interacting using a multimodal knowledge graph.

本発明によれば、ユーザと対話するようにコンピュータを機能させるプログラムにおいて、
教師データとして、
発話文と、応答文及び応答対象画像との組を複数含む対話履歴と、
実体対象画像が対応付けられた実体語同士の間を、関係語によってリンクしたマルチモーダルな知識グラフと
を用いて、
訓練時に、
教師データの発話文から発話特徴ベクトルを生成する発話特徴ベクトル生成手段と、
教師データの発話文に含まれる１つ以上の発話実体語を検出し、知識グラフを用いて当該発話実体語から関係語によってリンクする実体語を検索する知識グラフ検索手段と、
実体語、実体対象画像及び関係語から、知識特徴ベクトルを生成する知識特徴ベクトル生成手段と、
発話特徴ベクトルと知識特徴ベクトルとを結合して、結合発話特徴ベクトルを生成する結合層と、
教師データの発話文に対応する教師データの応答文及び応答対象画像から、応答特徴ベクトルを生成する応答特徴ベクトル生成手段と、
結合発話特徴ベクトルを入力し、応答特徴ベクトルを出力するように訓練したエンコーダデコーダと
してコンピュータを機能させることを特徴とする。 According to the present invention, in a program that causes a computer to interact with a user,
As teacher data,
a dialogue history including a plurality of sets of utterance sentences, response sentences, and response target images;
Using a multimodal knowledge graph in which entity words associated with entity target images are linked by relational words,
during training,
an utterance feature vector generation means for generating an utterance feature vector from an utterance sentence of teacher data;
knowledge graph search means for detecting one or more utterance entity words contained in an utterance sentence of training data and searching for entity words linked from the utterance entity words by related terms using a knowledge graph;
knowledge feature vector generation means for generating a knowledge feature vector from the entity word, the entity target image, and the related term;
a connection layer that combines the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector;
response feature vector generation means for generating a response feature vector from a response sentence of teacher data corresponding to an utterance sentence of teacher data and a response target image;
It is characterized by having a computer act as an encoder-decoder trained to input a combined utterance feature vector and output a response feature vector.

本発明のプログラムにおける他の実施形態によれば、
知識グラフ蓄積手段は、知識グラフの実体語及び関係語をキーとして、検索サイトによって画像を検索し、検索された画像を当該実体語に対応付けたものである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the knowledge graph accumulating means searches for images from a search site using the entity words and related words in the knowledge graph as keys, and causes the computer to function such that the retrieved images are associated with the entity words. .

本発明のプログラムにおける他の実施形態によれば、
対話時として、
対象データとなる発話文を入力し、
発話特徴ベクトル生成手段は、対象データの発話文から発話特徴ベクトルを生成し、
知識グラフ検索手段は、対象データの発話文に含まれる１つ以上の発話実体語を検出し、知識グラフを用いて当該発話実体語から関係語によってリンクする実体語及び実体対象画像を検索し、
知識特徴ベクトル生成手段は、知識グラフ検索手段によって検索された実体語、実体対象画像及び関係語から、知識特徴ベクトルを生成し、
エンコーダデコーダは、結合発話特徴ベクトルを入力し、応答特徴ベクトルを出力し、
応答特徴ベクトル生成手段は、応答特徴ベクトルを入力し、応答文及び応答対象画像を出力する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
As the dialogue
Enter the utterance sentence that will be the target data,
Utterance feature vector generation means generates an utterance feature vector from the utterance sentence of the target data,
The knowledge graph search means detects one or more utterance entity words included in the utterance sentence of the target data, and uses the knowledge graph to retrieve entity words and entity target images linked from the utterance entity words by related words,
knowledge feature vector generating means generates a knowledge feature vector from the entity words, entity target images, and related words retrieved by the knowledge graph retrieval means;
an encoder decoder inputs a combined utterance feature vector and outputs a response feature vector;
It is also preferable that the response feature vector generating means causes the computer to function so as to input the response feature vector and output the response sentence and the response target image.

本発明のプログラムにおける他の実施形態によれば、
発話文に、発話対象画像が対応付けられている
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable to make the computer function such that the utterance target image is associated with the utterance sentence.

本発明のプログラムにおける他の実施形態によれば、
知識グラフ検索手段は、知識グラフを用いて、当該発話実体語から１つ以上の所定ホップ数で関係語によってリンクする実体語及び実体対象画像を検索する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
Preferably, the knowledge graph search means causes the computer to search for entity words and entity target images linked by relation terms from the utterance entity word at one or more predetermined hop counts using the knowledge graph.

本発明のプログラムにおける他の実施形態によれば、
エンコーダデコーダは、当該エンコーダデコーダから出力された応答特徴ベクトルと、応答文特徴ベクトル生成手段から生成された応答特徴ベクトルとの間の損失が最小となるように訓練する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The encoder-decoder may cause the computer to train so as to minimize the loss between the response feature vector output from the encoder-decoder and the response feature vector generated by the response sentence feature vector generating means. preferable.

本発明のプログラムにおける他の実施形態によれば、
知識特徴ベクトル生成手段は、ＧＮＮ(Graph Neural Network)である
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the knowledge feature vector generating means causes the computer to function as if it were a GNN (Graph Neural Network).

本発明によれば、ユーザと対話する対話装置において、
教師データとして、
発話文と、応答文及び応答対象画像との組を複数含む対話履歴と、
実体対象画像が対応付けられた実体語同士の間を、関係語によってリンクしたマルチモーダルな知識グラフと
を用いて、
訓練時に、
教師データの発話文から発話特徴ベクトルを生成する発話特徴ベクトル生成手段と、
教師データの発話文に含まれる１つ以上の発話実体語を検出し、知識グラフを用いて当該発話実体語から関係語によってリンクする実体語を検索する知識グラフ検索手段と、
実体語、実体対象画像及び関係語から、知識特徴ベクトルを生成する知識特徴ベクトル生成手段と、
発話特徴ベクトルと知識特徴ベクトルとを結合して、結合発話特徴ベクトルを生成する結合層と、
教師データの発話文に対応する教師データの応答文及び応答対象画像から、応答特徴ベクトルを生成する応答特徴ベクトル生成手段と、
結合発話特徴ベクトルを入力し、応答特徴ベクトルを出力するように訓練したエンコーダデコーダと
を有することを特徴とする。 According to the present invention, in an interactive device that interacts with a user,
As teacher data,
a dialogue history including a plurality of sets of utterance sentences, response sentences, and response target images;
Using a multimodal knowledge graph in which entity words associated with entity target images are linked by relational words,
during training,
an utterance feature vector generation means for generating an utterance feature vector from an utterance sentence of training data;
knowledge graph search means for detecting one or more utterance entity words contained in an utterance sentence of training data and searching for entity words linked from the utterance entity words by related terms using a knowledge graph;
knowledge feature vector generation means for generating a knowledge feature vector from the entity word, the entity target image, and the related term;
a connection layer that combines the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector;
response feature vector generation means for generating a response feature vector from a response sentence of teacher data corresponding to an utterance sentence of teacher data and a response target image;
an encoder-decoder trained to input the combined utterance feature vector and output a response feature vector.

本発明によれば、ユーザと対話する装置の対話方法において、
教師データとして、
発話文と、応答文及び応答対象画像との組を複数含む対話履歴と、
実体対象画像が対応付けられた実体語同士の間を、関係語によってリンクしたマルチモーダルな知識グラフと
を用いて、
装置は、訓練時に、
教師データの発話文から発話特徴ベクトルを生成する第１のステップと、
教師データの発話文に含まれる１つ以上の発話実体語を検出し、知識グラフを用いて当該発話実体語から関係語によってリンクする実体語を検索する第２のステップと、
実体語、実体対象画像及び関係語から、知識特徴ベクトルを生成する第３のステップと、
発話特徴ベクトルと知識特徴ベクトルとを結合して、結合発話特徴ベクトルを生成する第４のステップと、
教師データの発話文に対応する教師データの応答文及び応答対象画像から、応答特徴ベクトルを生成する第５のステップと、
エンコーダデコーダを用いて、結合発話特徴ベクトルを入力し、応答特徴ベクトルを出力するように訓練する第６のステップと
を実行することを特徴とする。 According to the present invention, in an interaction method for a device that interacts with a user,
As teacher data,
a dialogue history including a plurality of sets of utterance sentences, response sentences, and response target images;
Using a multimodal knowledge graph in which entity words associated with entity target images are linked by relational words,
During training, the device
a first step of generating an utterance feature vector from an utterance sentence of teacher data;
a second step of detecting one or more utterance entity words contained in an utterance sentence of training data, and searching for entity words linked by related terms from the utterance entity words using a knowledge graph;
a third step of generating a knowledge feature vector from the entity words, entity target images and related terms;
a fourth step of combining the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector;
a fifth step of generating a response feature vector from the response target image and the response sentence of the teacher data corresponding to the utterance sentence of the teacher data;
and a sixth step of training the encoder-decoder to input the combined utterance feature vector and output the response feature vector.

本発明のプログラム、装置及び方法によれば、マルチモーダルな知識グラフを用いて雑談的に対話することができる。 According to the program, device, and method of the present invention, it is possible to have a casual conversation using a multimodal knowledge graph.

本発明の対話装置における訓練時の機能構成図である。FIG. 4 is a functional configuration diagram during training in the dialogue device of the present invention; 訓練時の教師データとしての対話履歴を表す説明図である。FIG. 10 is an explanatory diagram showing a dialogue history as teacher data during training; 訓練時の教師データとしての知識グラフを表す第１の説明図である。FIG. 10 is a first explanatory diagram showing a knowledge graph as teacher data during training; 訓練時の教師データとしての知識グラフを表す第２の説明図である。FIG. 11 is a second explanatory diagram showing a knowledge graph as teacher data during training; 本発明の対話装置における特徴ベクトルの訓練を表す説明図である。FIG. 4 is an explanatory diagram showing feature vector training in the dialog apparatus of the present invention; 本発明の対話装置における対話時の機能構成図である。FIG. 3 is a functional configuration diagram during dialogue in the dialogue device of the present invention; 本発明における第１の対話例を表す説明図である。FIG. 4 is an explanatory diagram showing a first dialogue example in the present invention; 本発明における第２の対話例を表す説明図である。FIG. 10 is an explanatory diagram showing a second dialogue example in the present invention;

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の対話装置における訓練時の機能構成図である。 FIG. 1 is a functional configuration diagram during training in the interactive device of the present invention.

図１によれば、対話装置１は、マルチモーダルな知識グラフを用いて、ユーザとの間で雑談のような自然な対話を実現することができる。対話装置１は、複数の機械学習エンジンを搭載しており、＜訓練時＞及び＜対話時＞に分けられる。また、対話装置１は、機械学習エンジンの訓練時に、＜教師データ＞によって学習モデルを構築する。 According to FIG. 1, the dialogue apparatus 1 can realize natural dialogue such as chatting with the user using a multimodal knowledge graph. The dialogue device 1 is equipped with a plurality of machine learning engines, and is divided into <during training> and <during dialogue>. Further, the interactive device 1 builds a learning model using <teacher data> during training of the machine learning engine.

本発明の対話装置１は、深層学習モデルにおける分散表現生成部(embedder)及びエンコーダデコーダ(encoder-decoder)を用いて、テキストと画像との両方をクロスモーダル的に訓練することにある。これによって、発話文（及び発話対象画像）に対して、マルチモーダルな応答文及び応答対象画像を生成することができる。 The dialogue device 1 of the present invention is to cross-modally train both text and images using a distributed representation generator (embedder) and an encoder-decoder (encoder-decoder) in a deep learning model. This makes it possible to generate a multimodal response sentence and response target image for the utterance sentence (and utterance target image).

＜教師データ＞
図１によれば、対話装置１は、教師データとして、対話履歴蓄積部１００と、知識グラフ蓄積部１０１とを有する。 <Teacher data>
According to FIG. 1, the dialogue device 1 has a dialogue history accumulation unit 100 and a knowledge graph accumulation unit 101 as teacher data.

［対話履歴蓄積部１００］
対話履歴蓄積部１００は、教師データとして、少なくとも「発話文」と、「応答文」及び「応答対象画像」との組を複数含む「対話履歴」を蓄積する。ここで、発話文にも、「発話対象画像」が対応付けられていてもよい。即ち、対話履歴は、ユーザ同士で画像も交換されるマルチモーダル情報からなる。
対話履歴は、過去にユーザ同士の間で大量にやりとりされた一連の対話文である。本発明によれば、少なくとも応答文に「応答対象画像」が対応付けられており、発話文に「発話対象画像」が対応付けられていてもよい。
勿論、対話履歴蓄積部１００は、対話装置１自らが記憶しておく必要はないが、訓練時に外部から入力する必要がある。 [Dialogue history accumulation unit 100]
The dialogue history accumulation unit 100 accumulates, as teacher data, "dialogue histories" including at least a plurality of sets of "utterance sentences", "response sentences", and "response target images". Here, the utterance sentence may also be associated with the "utterance target image". That is, the dialogue history consists of multimodal information in which images are also exchanged between users.
A dialogue history is a series of dialogue sentences that have been exchanged between users in large quantities in the past. According to the present invention, at least a response sentence may be associated with a "response target image", and an utterance sentence may be associated with an "utterance target image".
Of course, the dialogue history accumulation unit 100 does not need to be stored by the dialogue apparatus 1 itself, but must be input from the outside during training.

図２は、訓練時の教師データとしての対話履歴を表す説明図である。 FIG. 2 is an explanatory diagram showing a dialogue history as teacher data during training.

図２によれば、ユーザＡ及びＢが、画像を用いて対話文がやりとりされている。ここでの画像は、対話中に視聴されている映像から切り取られた画像であってもよいし、カメラによって撮影された画像や、インターネットによって検索された引用画像であってもよい。
図２によれば、以下のように対話している。
・・・・・・・・・・・・・・・・・・・・
ユーザＢ：どのようなテレビ番組が好きですか？
ユーザＡ：犬猫よりも野生動物が好きかな。
ユーザＢ：ライオンですか？（ライオン画像）
ユーザＡ：いや、象の親子のようなのがかわいいよね（象の親子の画像）
・・・・・・・・・・・・・・・・・・・・
本発明によれば、ユーザ同士の間で対話された、テキストのみならず、画像も含むマルチモーダル情報のやりとりとなる対話履歴を、教師データとして利用する。 According to FIG. 2, users A and B are exchanging dialogue sentences using images. The image here may be an image clipped from the video being viewed during the dialogue, an image captured by a camera, or a quoted image searched on the Internet.
According to FIG. 2, the dialogue is as follows.
・・・・・・・・・・・・・・・・・・・・
User B: What kind of TV shows do you like?
User A: Do you like wild animals more than dogs and cats?
User B: Are you a lion? (lion image)
User A: No, it's cute to look like an elephant parent and child (image of an elephant parent and child)
・・・・・・・・・・・・・・・・・・・・
According to the present invention, a dialogue history, which is an exchange of multimodal information including not only text but also images, between users, is used as training data.

［知識グラフ蓄積部１０１］
知識グラフ蓄積部１０１は、実体対象画像が対応付けられた実体語同士の間を、関係語によってリンクしたマルチモーダルな「知識グラフ」を蓄積する。また、実体語には、関連トピックとしての文章が対応付けられたものであってもよい。 [Knowledge graph accumulation unit 101]
The knowledge graph accumulation unit 101 accumulates a multimodal “knowledge graph” in which entity words associated with entity target images are linked by relational terms. Also, the entity word may be associated with a sentence as a related topic.

一般的な知識グラフとして、非特許文献２に記載された「KdConv」という雑談対話コーパスがある。しかしながら、KdConvは、実体語に画像を対応付けたものではなく、マルチモーダル的なものではない。
これに対し、本発明の知識グラフは、実体語に画像を対応付け、マルチモーダル的なものとして構築したものである。 As a general knowledge graph, there is a casual dialogue corpus called “KdConv” described in Non-Patent Document 2. However, KdConv does not associate images with entity words, and is not multimodal.
In contrast, the knowledge graph of the present invention is constructed as a multimodal one by associating images with entity words.

知識グラフ蓄積部１０１は、知識グラフの実体語及び関係語をキーとして、検索サイトによって画像を検索し、検索された画像を当該実体語に対応付けたものであってもよい。例えばKdConvのような雑談対話コーパスに、実体語及び関係語をキーとして検索した画像を、その実体語に対応付けることもできる。 The knowledge graph accumulating unit 101 may retrieve images from a search site using entity words and related words in the knowledge graph as keys, and associate the retrieved images with the entity words. For example, an image obtained by searching a conversational dialogue corpus such as KdConv using an entity word and a related word as keys can be associated with the entity word.

他の実施形態における大規模な知識グラフとして、例えばWikipedia（登録商標）を用いることもできる。Wikipediaを検索して、取得された文章の一部又は要約を、ノードとしてリンクさせる。 Wikipedia (registered trademark), for example, can also be used as a large-scale knowledge graph in other embodiments. Search Wikipedia and link a part or abstract of the obtained text as a node.

図３は、訓練時の教師データとしての知識グラフを表す第１の説明図である。
図３によれば、実体語「象」から見て、関係語によってリンクされた複数の実体語が表されている。「象」には、関係語「全長」「特徴」「由来」「創作物」「属性」によって、それぞれの先に実体語（関連トピックを含む）がリンクされている。 FIG. 3 is a first explanatory diagram showing a knowledge graph as teacher data during training.
According to FIG. 3, a plurality of entity words linked by relational words are represented as seen from the entity word "elephant". "Elephant" is linked to entity words (including related topics) by the related words "total length", "feature", "origin", "creation", and "attribute".

図４は、訓練時の教師データとしての知識グラフを表す第２の説明図である。
図４によれば、実体語「天国に続く道」から見て、関係語によってリンクされた複数の実体語が表されている。「天国に続く道」には、関係語「全長」「特徴」「由来」「近くの観光スポット」「所在地」によって、それぞれの先に実体語（関連トピックを含む）がリンクされている。 FIG. 4 is a second explanatory diagram showing a knowledge graph as teacher data during training.
According to FIG. 4, a plurality of entity words linked by relational words are represented as seen from the entity word "the road leading to heaven". In "Road to Heaven", entity words (including related topics) are linked to each other by the related words "total length", "characteristics", "origin", "nearby sightseeing spots", and "location".

＜訓練時＞
図１によれば、対話装置１は、発話特徴ベクトル生成部１１と、知識グラフ検索部１２と、知識特徴ベクトル生成部１３と、結合層１４と、応答特徴ベクトル生成部１５と、エンコーダデコーダ１６とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、対話装置の訓練方法としても理解できる。 <during training>
According to FIG. 1, the dialogue device 1 includes an utterance feature vector generator 11, a knowledge graph searcher 12, a knowledge feature vector generator 13, a joint layer 14, a response feature vector generator 15, an encoder decoder 16 and These functional components are implemented by executing a program that causes a computer installed in the device to function. In addition, the flow of processing of these functional components can also be understood as a training method for the interactive device.

図５は、本発明の対話装置における特徴ベクトルの訓練を表す説明図である。 FIG. 5 is an explanatory diagram showing feature vector training in the dialog system of the present invention.

［発話特徴ベクトル生成部１１］
発話特徴ベクトル生成部１１は、教師データの発話文から発話特徴ベクトルを生成する。発話特徴ベクトル生成部１１は、発話文と、それに加えた発話対象画像とを入力し、それぞれから発話特徴ベクトルを生成する。生成した発話特徴ベクトルは、結合層１４へ入力される。 [Utterance Feature Vector Generation Unit 11]
The utterance feature vector generation unit 11 generates an utterance feature vector from an utterance sentence of teacher data. The utterance feature vector generation unit 11 receives an utterance sentence and an utterance target image added thereto, and generates an utterance feature vector from each of them. The generated utterance feature vector is input to the coupling layer 14 .

発話文に対する発話特徴ベクトルは、具体的にはBERT（登録商標）やGPT-2（登録商標）のような分散表現生成アルゴリズム(embedding)を適用し、高次元ベクトルに置き換えたものである。また、発話対象画像に対する発話特徴ベクトルは、具体的にはVisualBERT（登録商標）を適用したものである。 The utterance feature vector for the utterance sentence is specifically obtained by applying a distributed representation generation algorithm (embedding) such as BERT (registered trademark) or GPT-2 (registered trademark) and replacing it with a high-dimensional vector. Further, the utterance feature vector for the utterance target image is specifically obtained by applying VisualBERT (registered trademark).

前述した図２の対話履歴によれば、発話特徴ベクトル生成部１１は、対話履歴におけるユーザＡの発話文「象の親子のようなのがかわいいよね」と、発話対象画像「象の親子」とを入力し、発話特徴ベクトルを生成する。 According to the dialogue history of FIG. 2 described above, the utterance feature vector generation unit 11 generates the utterance sentence of user A in the dialogue history, "It's cute like an elephant parent and child, isn't it?" input and generate an utterance feature vector.

［知識グラフ検索部１２］
知識グラフ検索部１２は、教師データの発話文に含まれる１つ以上の発話実体語を検出し、知識グラフを用いて当該発話実体語から関係語によってリンクする実体語を検索する。検索された実体語、実体対象画像及び関係語は、知識特徴ベクトル生成部１３へ出力される。
ここで、発話実体語から関係語の１ホップ（所定ホップ数）でリンクする実体語のみを、検索してもよい。１ホップ内の他の実体語は、当該発話実体語との関連性が極めて高い知識といえる。 [Knowledge graph search unit 12]
The knowledge graph search unit 12 detects one or more utterance entity words included in an utterance sentence of teacher data, and uses the knowledge graph to retrieve entity words linked from the utterance entity words by related terms. The retrieved entity words, entity target images, and related terms are output to the knowledge feature vector generation unit 13 .
Here, it is also possible to retrieve only the entity word linked from the uttered entity word by one hop (predetermined number of hops) of the related word. Other entity words within one hop can be said to be knowledge that is highly related to the utterance entity word.

前述した図２によれば、知識グラフ検索部１２は、対話履歴におけるユーザＡの発話文「象の親子のようなのがかわいいよね」から、発話実体語「象」という単語を生成する。そして、知識グラフ検索部１２は、知識グラフ蓄積部１０１を用いて、発話実体語「象」から関係語によってリンクする他の実体語を検索する。このとき、発話実体語から１ホップで関係語によってリンクする実体語のみが検索されている。 According to FIG. 2 described above, the knowledge graph search unit 12 generates the utterance entity word "elephant" from the utterance sentence of the user A in the dialogue history, "It's cute that it looks like an elephant parent and child." Then, the knowledge graph search unit 12 uses the knowledge graph storage unit 101 to search for other entity words linked from the utterance entity word "elephant" by the related words. At this time, only entity words linked by relational words in one hop from the uttered entity word are retrieved.

［知識特徴ベクトル生成部１３］
知識特徴ベクトル生成部１３は、知識グラフ検索部１２によって検索された実体語、実体対象画像及び関係語から、知識特徴ベクトルを生成する。知識特徴ベクトルは、結合層１４へ入力される。 [Knowledge Feature Vector Generation Unit 13]
The knowledge feature vector generation unit 13 generates a knowledge feature vector from the entity word, entity target image, and related term searched by the knowledge graph search unit 12 . A knowledge feature vector is input to the coupling layer 14 .

知識特徴ベクトル生成部１３は、ＧＮＮ(Graph Neural Network)であってもよい。
ＣＮＮ(Convolutional Neural Network)は、例えば画像の上下左右斜めの８方向からの情報を畳み込んでいくのに対して、ＧＮＮは、ノードと、そのノードにリンクする他のノードの情報を畳み込むものである。 The knowledge feature vector generator 13 may be a GNN (Graph Neural Network).
A CNN (Convolutional Neural Network), for example, convolves information from eight directions, up, down, left, and right diagonally of an image. be.

［結合層１４］
結合層１４は、発話特徴ベクトルと知識特徴ベクトルとを結合して、結合発話特徴ベクトルを生成する。生成された結合発話特徴ベクトルは、エンコーダデコーダ１６のエンコーダ側へ入力される。 [Binding layer 14]
A joint layer 14 combines the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector. The generated combined utterance feature vector is input to the encoder side of encoder decoder 16 .

［応答特徴ベクトル生成部１５］
応答特徴ベクトル生成部１５は、教師データの発話文に対応する教師データの応答文及び応答対象画像から、応答特徴ベクトルを生成する。
応答特徴ベクトル生成部１５は、応答文及び応答話対象画像を入力し、それぞれから応答特徴ベクトルを生成する。生成した応答特徴ベクトルは、エンコーダデコーダ１６のデコーダ側へ入力される。
応答文に対する応答特徴ベクトルと同様に、具体的にはBERT（登録商標）やGPT-2（登録商標）のような分散表現生成アルゴリズム(embedding)を適用し、高次元ベクトルに置き換えたものである。また、応答対象画像に対する応答特徴ベクトルは、具体的にはVisualBERT（登録商標）を適用したものである。 [Response feature vector generator 15]
The response feature vector generation unit 15 generates a response feature vector from the response sentence of the teacher data corresponding to the utterance sentence of the teacher data and the response target image.
The response feature vector generation unit 15 receives a response sentence and a response speech target image, and generates a response feature vector from each of them. The generated response feature vector is input to the decoder side of the encoder decoder 16 .
Similar to response feature vectors for response sentences, specifically, distributed representation generation algorithms (embedding) such as BERT (registered trademark) and GPT-2 (registered trademark) are applied and replaced with high-dimensional vectors. . Moreover, the response feature vector for the response target image is specifically obtained by applying VisualBERT (registered trademark).

BERT(Bidirectional Encoder Representations from Transformers)とは、Transformerアーキテクチャによる双方向学習のエンコード表現であり、Google（登録商標）の自然言語処理モデルである。画像についてはVisualBERTがある。BERTは、Seq2seqベースの事前学習モデルであり、ラベルが付与されていない特徴ベクトル（分散表現）をTransformerで処理して学習する。これは、連続する文章の中で、次に来る単語を単に予測するだけでなく、周りの文脈からからマスクされている単語を双方向で予測する。これによって、単語に対応する文脈情報を学習する。
また、GPT-2(Generative Pre-Training 2)は、Open AIに基づくものであり、自然言語に代えてピクセルで学習することによって、前半の画像（又は一部の画像）のシーケンスから、人間が感覚的に考えるように、後半の画像（又は画像全体）を予測することができる。 BERT (Bidirectional Encoder Representations from Transformers) is an encoded representation of bidirectional learning by the Transformer architecture, and is a natural language processing model of Google (registered trademark). For images there is VisualBERT. BERT is a Seq2seq-based pre-trained model that is learned by processing unlabeled feature vectors (distributed representation) with Transformer. It not only predicts the next word in a sequence of sentences, but also predicts words that are masked from the surrounding context in both directions. This learns the contextual information corresponding to the word.
In addition, GPT-2 (Generative Pre-Training 2) is based on Open AI, and by learning with pixels instead of natural language, from the first half of the image (or part of the image) sequence, human The second half of the image (or the whole image) can be predicted as intuitively.

ここで、結合発話特徴ベクトル（発話特徴ベクトル及び知識特徴ベクトル）並びに応答特徴ベクトルには、潜在的に「注意機構(Attention)」を含む。注意機構は、自然言語処理によって文として自然であることを過度に優先することのないようにしたものである。これによって、重要視すべき単語や語句が指定され、エンコーダデコーダ１６における適切な自然言語処理が可能となる。 Here, the combined speech feature vector (speech feature vector and knowledge feature vector) and the response feature vector potentially contain "attention". The attention mechanism is designed so that natural language processing does not give excessive priority to natural sentences. As a result, words and phrases to be emphasized are specified, and appropriate natural language processing in the encoder/decoder 16 is enabled.

［エンコーダデコーダ１６］
エンコーダデコーダ１６は、結合発話特徴ベクトルを入力し、応答特徴ベクトルを出力するように訓練する。
エンコーダデコーダ１６について、エンコーダは、発話文（及び発話対象画像）並びに知識グラフに基づく結合発話特徴ベクトルを入力し、潜在ベクトルを出力する。一方で、デコーダは、エンコーダから出力された潜在ベクトルを入力し、応答特徴ベクトルを出力する。
このとき、エンコーダデコーダ１６は、当該エンコーダデコーダ１６から出力された応答特徴ベクトルと、応答文特徴ベクトル生成部１５から生成された応答特徴ベクトルとの間の損失が最小となるように訓練する。 [Encoder decoder 16]
The encoder-decoder 16 is trained to input the combined utterance feature vector and output the response feature vector.
For the encoder decoder 16, the encoder inputs the utterance sentence (and the utterance target image) and the joint utterance feature vector based on the knowledge graph, and outputs the latent vector. On the one hand, the decoder inputs the latent vector output from the encoder and outputs the response feature vector.
At this time, the encoder/decoder 16 trains so that the loss between the response feature vector output from the encoder/decoder 16 and the response feature vector generated by the response sentence feature vector generator 15 is minimized.

エンコーダデコーダ１６は、Transformerに基づくものであってもよい。前述したように、エンコーダデコーダ１６は、ラベル付けされていない言語に基づく特徴ベクトルと画像に基づく特徴ベクトルとを、クロスモーダル的に訓練する。これは、発話文（及び発話対象画像）と、応答文及び応答対象画像と、知識グラフとを関連付けを訓練したこととなる。 The encoder decoder 16 may be Transformer based. As previously described, encoder-decoder 16 cross-modally trains unlabeled language-based feature vectors and image-based feature vectors. This means training to associate the utterance sentence (and utterance target image), the response sentence and response target image, and the knowledge graph.

＜対話時＞
図６は、本発明の対話装置における対話時の機能構成図である。
図６によれば、対話装置１における対話時の機能構成は、図１で前述した訓練時の機能構成と同じである。 <during dialogue>
FIG. 6 is a functional configuration diagram during dialogue in the dialogue device of the present invention.
According to FIG. 6, the functional configuration during dialogue in the dialog apparatus 1 is the same as the functional configuration during training described above with reference to FIG.

対話装置１は、通信インタフェース１０２を更に有し、ユーザインタフェースとなる端末２から発話文（及び発話対象画像）を受信し、端末２へ応答文及び応答対象画像を送信する。
通信インタフェース１０２は、ユーザの発話音声の音声認識機能、及び、ユーザへの応答文の音声合成機能を有する。音声認識機能は、端末２のマイクによって取得されたユーザの発話音声を、テキストベースの発話文に変換する。音声合成機能は、生成された応答文を、音声信号に変換する。これら発話文及び応答文の組の履歴が、対話文履歴となる。
尚、音声認識機能及び音声合成機能は、端末２に搭載されるものであってもよい。その場合、端末２からテキストベースの「発話文」を受信すると共に、端末２へ「応答文」を送信する。 The dialogue device 1 further has a communication interface 102 , receives an utterance sentence (and an utterance target image) from the terminal 2 serving as a user interface, and transmits a response sentence and a response target image to the terminal 2 .
The communication interface 102 has a speech recognition function of user's uttered voice and a speech synthesis function of a response sentence to the user. The voice recognition function converts the user's uttered voice acquired by the microphone of the terminal 2 into a text-based uttered sentence. The speech synthesis function converts the generated response sentence into a speech signal. A history of a set of these utterance sentences and response sentences is a dialogue sentence history.
Note that the voice recognition function and voice synthesis function may be installed in the terminal 2 . In that case, it receives a text-based “utterance sentence” from the terminal 2 and transmits a “response sentence” to the terminal 2 .

前述した図１における訓練時では、教師データについて処理されるのに対し、図６における対話時では、通信インタフェース１０２によってリアルタイムに受信した対象データについて処理される。 During training in FIG. 1 described above, teacher data is processed, while during dialogue in FIG. 6, target data received in real time by communication interface 102 is processed.

端末２は、ユーザからマルチモーダル情報を取得し、ユーザへマルチモーダル情報を表示可能なデバイスを搭載している。少なくとも、ユーザへ画像を表示するディスプレイと、ユーザからの発話音声を収音可能なマイクと、ユーザが視聴中の画像を撮影可能なカメラとを搭載する。このような端末２としては、例えば「SOTA（登録商標）」「ユニボー（登録商標）」のようなロボット（以下「端末」と称す）がある。また、ディスプレイ、マイク及びカメラを備えた「Google Home（登録商標）」や「Amazon Echo（登録商標）」のようなタブレットであってもよい。 The terminal 2 is equipped with a device capable of acquiring multimodal information from a user and displaying the multimodal information to the user. At least, it is equipped with a display for displaying an image to the user, a microphone capable of picking up a voice uttered by the user, and a camera capable of capturing an image that the user is viewing. Examples of such terminals 2 include robots (hereinafter referred to as "terminals") such as "SOTA (registered trademark)" and "Unibo (registered trademark)". It may also be a tablet such as "Google Home (registered trademark)" or "Amazon Echo (registered trademark)" equipped with a display, a microphone and a camera.

図６によれば、発話特徴ベクトル生成部１１は、ユーザの発話文（及び発話対象画像）を入力し、発話特徴ベクトルをエンコーダデコーダ１６へ出力する。
発話特徴ベクトル生成部１１は、対象データの発話文から発話特徴ベクトルを生成し、その発話特徴ベクトルを、結合層１４へ入力する。
知識グラフ検索部１２は、対象データの発話文に含まれる１つ以上の発話実体語を検出し、知識グラフを用いて当該発話実体語から関係語によってリンクする実体語を検索する。
知識特徴ベクトル生成部１３は、知識グラフ検索部１２によって検索された実体語及び関係語から知識特徴ベクトルを生成し、その知識特徴ベクトルを、結合層１４へ入力する。
結合層１４は、発話特徴ベクトルと知識特徴ベクトルとを結合して、結合発話特徴ベクトルを生成し、当該結合発話特徴ベクトルをエンコーダデコーダ１６のエンコーダ側へ出力する。
エンコーダデコーダ１６は、結合発話特徴ベクトルを入力し、応答特徴ベクトルを応答特徴ベクトル生成部１５へ出力する。
応答特徴ベクトル生成部１５は、応答特徴ベクトルを入力し、応答文及び応答対象画像を生成し、それらを通信インタフェース１０２から端末２へ送信する。 According to FIG. 6 , the utterance feature vector generation unit 11 inputs a user's utterance sentence (and an utterance target image) and outputs the utterance feature vector to the encoder/decoder 16 .
The utterance feature vector generation unit 11 generates an utterance feature vector from the utterance sentence of the target data, and inputs the utterance feature vector to the coupling layer 14 .
The knowledge graph search unit 12 detects one or more utterance entity words included in the utterance sentence of the target data, and uses the knowledge graph to retrieve entity words linked from the utterance entity words by related terms.
The knowledge feature vector generation unit 13 generates a knowledge feature vector from the entity words and related words searched by the knowledge graph search unit 12 and inputs the knowledge feature vector to the connection layer 14 .
The connection layer 14 combines the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector, and outputs the combined utterance feature vector to the encoder side of the encoder decoder 16 .
The encoder/decoder 16 receives the combined utterance feature vector and outputs the response feature vector to the response feature vector generator 15 .
The response feature vector generation unit 15 inputs a response feature vector, generates a response sentence and a response target image, and transmits them to the terminal 2 from the communication interface 102 .

図７は、本発明における第１の対話例を表す説明図である。 FIG. 7 is an explanatory diagram showing a first dialogue example in the present invention.

図７によれば、例えばユーザがテレビを見ながら、対話装置１と雑談対話をしているとする。このとき、ユーザが見ているテレビの映像を、対話装置１が認識していてもよい。
例えば、以下のように対話している。
・・・・・・・・・・・・・・・・・・・・
Ｓ: 今、野生動物の番組やってますよ。（野生動物の映像、例えばテレビ）
Ｕ: この象、かわいいね。
Ｓ: 親子ですね。
Ｕ: ところで、「象の由来」は？
Ｓ：「古代中国にも生息していたゾウの姿にかたどった象形文字である」とされています。（象形文字の画像）
Ｕ：ほう～ According to FIG. 7, it is assumed that the user is chatting with the interactive device 1 while watching television. At this time, the interactive device 1 may recognize the television image that the user is watching.
For example, we have the following dialogue:
・・・・・・・・・・・・・・・・・・・・
S: I'm doing a wildlife program right now. (wildlife footage, e.g. television)
U: This elephant is cute.
S: Parent and child.
U: By the way, what is the origin of the elephant?
S: It is said that it is a hieroglyph in the shape of an elephant that also lived in ancient China. (picture of hieroglyphs)
U: Hmm~

図７によれば、対話装置１は、ユーザから発話文「象の由来」を受信する。そして、対話装置１は、その発話文から生成された発話特徴ベクトルと、「象」「由来」を含む知識グラフに対する知識特徴ベクトルとから、応答文「古代中国にも生息していたゾウの姿にかたどった象形文字である」と応答対象画像「象形文字」とを出力することができる。象形文字やその画像は、過去のユーザ同士の対話履歴から得られない知識である。このような知識についても、雑談対話をすることができる。 According to FIG. 7, the interactive device 1 receives the utterance sentence "Origin of the elephant" from the user. Then, the dialogue device 1 generates the response sentence "The figure of an elephant that lived in ancient China" from the utterance feature vector generated from the utterance sentence and the knowledge feature vector for the knowledge graph including "elephant" and "origin". is a hieroglyph in the shape of '' and the response target image ``hieroglyph'' can be output. Hieroglyphs and their images are knowledge that cannot be obtained from the history of past conversations between users. Such knowledge can also be used for casual conversation.

図８は、本発明における第２の対話例を表す説明図である。 FIG. 8 is an explanatory diagram showing a second dialogue example in the present invention.

図８によれば、例えばユーザが自動車を運転しながら、対話装置１と雑談対話をしているとする。このとき、ユーザの視線先が端末２のカメラによって撮影され、その画像が発話対象画像として、対話装置１へ送信されているとする。
例えば、以下のように対話している。
・・・・・・・・・・・・・・・・・・・・
Ｕ：この道は、なんで「天国に続く道」と呼ばれているの？（視線先の画像）
Ｓ：「地平線まで続くように見える」からです。（道の画像） According to FIG. 8, for example, it is assumed that the user is having a casual conversation with the dialogue device 1 while driving a car. At this time, it is assumed that the user's line of sight is captured by the camera of the terminal 2 and that image is transmitted to the dialogue device 1 as an image to be spoken.
For example, we have the following dialogue:
・・・・・・・・・・・・・・・・・・・・
U: Why is this road called "The Road to Heaven"? (image of line of sight)
S: Because it "seems to continue to the horizon." (image of the road)

図８によれば、対話装置１は、ユーザからの発話文「天国に続く道」と、発話対象画像と受信する。そして、対話装置１は、その発話文及び発話対象画像から生成した発話特徴ベクトルと、「天国に続く道」を含む知識グラフに対する知識特徴ベクトルとから、応答文「地平線まで続くように見える」と応答対象画像（天国に続く道）とを出力することができる。天国に続く道の由来や画像は、過去のユーザ同士の対話履歴から得られない知識である。このような知識についても、雑談対話をすることができる。 According to FIG. 8, the interactive device 1 receives the utterance sentence "the road leading to heaven" from the user and the utterance target image. Then, the dialogue apparatus 1 uses the utterance feature vector generated from the utterance sentence and the utterance target image, and the knowledge feature vector for the knowledge graph including "the road leading to heaven" to generate the response sentence "It looks like it continues to the horizon." A response target image (a road leading to heaven) can be output. The origins and images of the road leading to heaven are knowledge that cannot be obtained from the history of conversations between users in the past. Such knowledge can also be used for casual conversation.

以上、詳細に説明したように、本発明の対話プログラム、装置及び方法によれば、マルチモーダルな知識グラフを用いて雑談的に対話することができる。 As described in detail above, according to the interactive program, apparatus, and method of the present invention, it is possible to have casual conversation using a multimodal knowledge graph.

従来技術としての非特許文献１、２及び４によれば、「画像を含むマルチモーダル雑談対話を展開できない」という課題があった。これに対して、本発明によれば、画像を含む知識グラフを用いることによって、テキストのみならず、マルチモーダルな雑談対話を実現することができる。
また、従来技術としての非特許文献１及び３によれば、「タスク向け対話に限定される」という課題があった。これに対して、本発明によれば、深層学習モデルで応答文及び応答対象画像を生成するために、特定のタスク向けに限定されず、自然な雑談対話を実現することができる。
更に、従来技術としての非特許文献１及び４によれば、「概念知識グラフに限定される」という課題があった。これに対して、本発明によれば、話題・トピックに基づく知識グラフを構築し、話題・トピックに関連性が高いインフォメーションと画像を全部知識グラフに格納する。こうした知識グラフを用いて、話題にめぐってマルチモーダル雑談対話の応答生成が期待できる。これによって、豊富な知識を含む雑談対話を実現することができる。 According to Non-Patent Documents 1, 2, and 4 as conventional techniques, there is a problem that "a multimodal chat dialogue including images cannot be developed." In contrast, according to the present invention, by using a knowledge graph including images, not only text, but also multimodal chat dialogue can be realized.
In addition, according to Non-Patent Documents 1 and 3 as prior art, there is a problem that "it is limited to task-oriented dialogue". In contrast, according to the present invention, since response sentences and response target images are generated by a deep learning model, natural chat conversations can be realized without being limited to specific tasks.
Furthermore, according to Non-Patent Documents 1 and 4 as prior art, there is a problem that they are "limited to conceptual knowledge graphs." In contrast, according to the present invention, a knowledge graph based on topics/topics is constructed, and all information and images highly relevant to the topics/topics are stored in the knowledge graph. Using such knowledge graphs, we can expect to generate responses in multimodal conversations about topics. As a result, it is possible to realize chat conversations containing a wealth of knowledge.

尚、これにより、例えば「マルチモーダルな知識グラフを用いて雑談的な対話によってユーザサポートやビジネスコンタクトをすることができる」ことから、国連が主導する持続可能な開発目標（ＳＤＧｓ）の目標８「すべての人々のための包摂的かつ持続可能な経済成長、雇用およびディーセント・ワークを推進する」に貢献することが可能となる。 In addition, as a result, for example, "user support and business contact can be made through chatty dialogue using a multimodal knowledge graph", so Goal 8 of the Sustainable Development Goals (SDGs) led by the United Nations Promote inclusive and sustainable economic growth, employment and decent work for all.”

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 For the various embodiments of the present invention described above, various changes, modifications and omissions within the spirit and scope of the present invention can be easily made by those skilled in the art. The foregoing description is exemplary only and is not intended to be limiting. The invention is to be limited only as limited by the claims and the equivalents thereof.

１対話装置
１００対話履歴蓄積部
１０１知識グラフ蓄積部
１０２通信インタフェース
１１発話特徴ベクトル生成部
１２知識グラフ検索部
１３知識特徴ベクトル生成部
１４結合層
１５応答特徴ベクトル生成部
１６エンコーダデコーダ
２端末
1 Dialogue Apparatus 100 Dialogue History Accumulator 101 Knowledge Graph Accumulator 102 Communication Interface 11 Utterance Feature Vector Generating Unit 12 Knowledge Graph Searching Unit 13 Knowledge Feature Vector Generating Unit 14 Connection Layer 15 Response Feature Vector Generating Unit 16 Encoder Decoder 2 Terminal

Claims

In a program that causes a computer to interact with a user,
As teacher data,
a dialogue history including a plurality of sets of utterance sentences, response sentences, and response target images;
Using a multimodal knowledge graph in which entity words associated with entity target images are linked by relational words,
during training,
an utterance feature vector generation means for generating an utterance feature vector from an utterance sentence of training data;
knowledge graph search means for detecting one or more utterance entity words contained in an utterance sentence of training data and searching for entity words linked from the utterance entity words by related terms using a knowledge graph;
knowledge feature vector generation means for generating a knowledge feature vector from the entity word, the entity target image, and the related term;
a connection layer that combines the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector;
response feature vector generation means for generating a response feature vector from a response sentence of teacher data corresponding to an utterance sentence of teacher data and a response target image;
A program characterized by causing a computer to act as an encoder-decoder trained to input a combined speech feature vector and output a response feature vector.

The knowledge graph accumulating means is characterized by retrieving images from a search site using entity words and related words in the knowledge graph as keys, and causing the computer to function as if the retrieved images were associated with the entity words. The program according to claim 1, wherein:

As the dialogue
Enter the utterance sentence that will be the target data,
Utterance feature vector generation means generates an utterance feature vector from the utterance sentence of the target data,
The knowledge graph search means detects one or more utterance entity words included in the utterance sentence of the target data, and uses the knowledge graph to retrieve entity words and entity target images linked from the utterance entity words by related words,
knowledge feature vector generating means generates a knowledge feature vector from the entity words, entity target images, and related words retrieved by the knowledge graph retrieval means;
an encoder decoder inputs a combined utterance feature vector and outputs a response feature vector;
3. The program according to claim 1, wherein the response feature vector generating means inputs the response feature vector and causes the computer to output the response sentence and the response target image.

4. The program according to any one of claims 1 to 3, causing a computer to function such that an utterance target image is associated with an utterance sentence.

The knowledge graph retrieving means uses the knowledge graph to make the computer function to retrieve the entity words and the entity target images linked by the relational words from the utterance entity word at one or more predetermined number of hops. A program according to any one of claims 1 to 4.

The encoder-decoder causes the computer to train so as to minimize the loss between the response feature vector output from the encoder-decoder and the response feature vector generated by the response sentence feature vector generating means. 6. A program according to any one of claims 1 to 5.

7. The program according to any one of claims 1 to 6, wherein the knowledge feature vector generating means causes the computer to function as if it were a GNN (Graph Neural Network).

In an interactive device that interacts with a user,
As teacher data,
a dialogue history including a plurality of sets of utterance sentences, response sentences, and response target images;
Using a multimodal knowledge graph in which entity words associated with entity target images are linked by relational words,
during training,
an utterance feature vector generation means for generating an utterance feature vector from an utterance sentence of teacher data;
knowledge graph search means for detecting one or more utterance entity words contained in an utterance sentence of training data and searching for entity words linked from the utterance entity words by related terms using a knowledge graph;
knowledge feature vector generation means for generating a knowledge feature vector from the entity word, the entity target image, and the related term;
a connection layer that combines the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector;
response feature vector generation means for generating a response feature vector from a response sentence of teacher data corresponding to an utterance sentence of teacher data and a response target image;
and an encoder-decoder trained to input a combined utterance feature vector and output a response feature vector.

In a method for interacting with a device that interacts with a user,
As teacher data,
a dialogue history including a plurality of sets of utterance sentences, response sentences, and response target images;
Using a multimodal knowledge graph in which entity words associated with entity target images are linked by relational words,
During training, the device
a first step of generating an utterance feature vector from an utterance sentence of training data;
a second step of detecting one or more utterance entity words included in an utterance sentence of training data, and searching for entity words linked by related terms from the utterance entity words using a knowledge graph;
a third step of generating a knowledge feature vector from the entity words, entity target images and related terms;
a fourth step of combining the utterance feature vector and the knowledge feature vector to generate a combined utterance feature vector;
a fifth step of generating a response feature vector from the response target image and the response sentence of the teacher data corresponding to the utterance sentence of the teacher data;
and a sixth step of training an encoder-decoder to input a combined speech feature vector and output a response feature vector.