JP7426919B2

JP7426919B2 - Program, device and method for estimating causal terms from images

Info

Publication number: JP7426919B2
Application number: JP2020183065A
Authority: JP
Inventors: 博楊; 剣明呉; 元服部
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2024-02-02
Anticipated expiration: 2040-10-30
Also published as: JP2022073219A

Description

本発明は、画像に対して、原因語及び結果語からなる因果関係語を推定する技術に関する。この技術は、映像を見ているユーザと対話エージェントとが自然な対話を実現する用途に適用することができる。 The present invention relates to a technique for estimating causal words consisting of a cause word and a result word for an image. This technology can be applied to applications in which a user viewing a video and a dialogue agent can have a natural dialogue.

ユーザとの対話システムとしては、テキストベースが一般的である。端末は、ユーザインタフェースとして機能し、ユーザの発話音声を対話システムへ送信する。対話システムは、その発話文に対して自然な対話となる応答文を推定し、その応答文を端末へ返信する。そして、端末は、その応答文を音声又はテキストによって、ユーザへ返答する。このような対話システムとしては、例えば「Siri（登録商標）」や「しゃべってコンシェル（登録商標）」がある。 Text-based systems are commonly used as interaction systems with users. The terminal functions as a user interface and transmits the user's speech to the dialogue system. The dialogue system estimates a response sentence that will be a natural dialogue for the uttered sentence, and sends the response sentence back to the terminal. Then, the terminal replies to the user with the response sentence in voice or text. Examples of such dialogue systems include "Siri (registered trademark)" and "Shabette Concierge (registered trademark)."

近年、ユーザ周辺のマルチモーダル情報（動画、画像、キャプション、字幕、音声、自然言語テキストなど）に応じた対話システムの技術が期待されている。この技術によれば、ユーザの発話文に対して、ユーザ周辺の様々なマルチモーダル情報に応じた自然な応答文を推定することができる。特に、ＡＩ(Artificial Intelligence)を用いて、テレビ番組や映画、オンラインビデオのような周辺状況に応じて自然な対話をすることができ、ユーザの対話意欲を高めることが期待される。 In recent years, there have been high expectations for dialogue system technology that responds to multimodal information (videos, images, captions, subtitles, audio, natural language text, etc.) surrounding the user. According to this technology, it is possible to estimate a natural response sentence to a user's utterance according to various multimodal information surrounding the user. In particular, by using AI (Artificial Intelligence), it is possible to have natural dialogues depending on the surrounding situation, such as TV programs, movies, and online videos, and it is expected that this will increase the user's desire for dialogue.

従来、ユーザが視聴している映像の内容に基づいて、ユーザとロボットとが対話する対話システムの技術がある（例えば非特許文献１参照）。この技術によれば、音声付き映像及び字幕を入力することによって、ユーザの質問文に対して、当該ユーザが視聴している映像に応じた応答文を返答することができる。 2. Description of the Related Art Conventionally, there is a technology for an interaction system in which a user and a robot interact based on the content of a video that the user is viewing (for example, see Non-Patent Document 1). According to this technology, by inputting a video with audio and subtitles, it is possible to respond to a user's question with a response text that corresponds to the video that the user is viewing.

また、音声付き映像及び字幕の特徴ベクトルを学習し、直前の質問文に対する応答文を生成する技術もある（例えば非特許文献２参照）。この技術によれば、対話システムは、訓練済みの学習モデルGPT-2（登録商標）を用いてファインチューニングをし、マルチモーダル情報に応じた応答文の対話精度を高めることができる。 There is also a technology that learns feature vectors of audio-accompanied video and subtitles and generates a response to the previous question (see, for example, Non-Patent Document 2). According to this technology, the dialogue system can perform fine tuning using the trained learning model GPT-2 (registered trademark) and improve the dialogue accuracy of response sentences according to multimodal information.

更に、発話文に対して因果関係を持つ応答文を生成し、自然な対話を実現する技術がある（例えば非特許文献３参照）。この技術によれば、因果関係を持つ単語ペア辞書を予め構築し、発話文と因果関係を持つ応答文を優先的に選択する（リランキング応答生成）。具体的には、ユーザの発話文の単語と応答文の単語とをペアとして、単語ペア辞書を照合する。照合一致した際に、因果関係があると判定し、この応答文を優先的に選択する。
例えば以下の文章に対して、因果関係語を抽出して学習することができる。
「円安になったため、貿易の視点から見ると日本の景気が上昇することが期待できる」
：因果関係語｛（円安になる）->（景気が上昇）｝ Furthermore, there is a technology that generates a response sentence that has a causal relationship with an uttered sentence and realizes a natural dialogue (for example, see Non-Patent Document 3). According to this technique, a dictionary of word pairs having a causal relationship is constructed in advance, and a response sentence having a causal relationship with an uttered sentence is preferentially selected (reranking response generation). Specifically, the words in the user's uttered sentence and the words in the response sentence are paired and compared against a word pair dictionary. When a match is found, it is determined that there is a causal relationship, and this response sentence is selected preferentially.
For example, it is possible to extract and learn causal words from the following sentences:
"As the yen has weakened, we can expect Japan's economy to improve from a trade perspective."
: Causal relationship word {(yen becomes weaker) -> (economy rises)}

更に、大規模な対話コーパスから因果関係を持つ対話データ（発話文及び応答文）のみを用いて、学習モデルを作成する技術もある（例えば非特許文献４参照） Furthermore, there is also a technique for creating a learning model using only dialogue data (utterances and response sentences) that have causal relationships from a large-scale dialogue corpus (for example, see Non-Patent Document 4).

Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi, “Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems”（2019）、[online]、［令和２年１０月１１日検索］、インターネット＜URL:https://arxiv.org/abs/1907.01166＞Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi, “Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems” (2019), [online], [Retrieved October 11, 2020] , Internet <URL: https://arxiv.org/abs/1907.01166> Hung Le, Steven C.H. Hoi, “Video-Grounded Dialogues with Pretrained Generation Language Models” (2020) 、[online]、［令和２年１０月１１日検索］、インターネット＜URL:https://www.aclweb.org/anthology/2020.acl-main.518/＞Hung Le, Steven C.H. Hoi, “Video-Grounded Dialogues with Pretrained Generation Language Models” (2020), [online], [Retrieved October 11, 2020], Internet <URL: https://www.aclweb. org/anthology/2020.acl-main.518/＞ Shohei Tanaka, Koichiro Yoshino, Katsuhito Sudoh, and Satoshi Nakamura. "Conversational Response Re-ranking Based on Event Causality and Role Factored Tensor Event Embedding" 1st Workshop NLP for Conversational AI, ACL 2019 Workshop (ConvAI).Shohei Tanaka, Koichiro Yoshino, Katsuhito Sudoh, and Satoshi Nakamura. "Conversational Response Re-ranking Based on Event Causality and Role Factored Tensor Event Embedding" 1st Workshop NLP for Conversational AI, ACL 2019 Workshop (ConvAI). 佐藤翔多, 乾健太郎. “因果関係に基づいてデータサンプリングを利用した雑談応答学習”言語処理大会第２４回年次大会発表論文集（２０１８年３月）Shota Sato, Kentaro Inui. “Chat response learning using data sampling based on causal relationships” Proceedings of the 24th Annual Conference on Language Processing (March 2018) 「深層学習界の大前提Transformerの論文解説！」、[online]、［令和２年１０月１１日検索］、インターネット＜URL:https://qiita.com/omiita/items/07e69aef6c156d23c538＞“Explanation of Transformer paper, a major premise of the deep learning world!”, [online], [Retrieved October 11, 2020], Internet <URL: https://qiita.com/omiita/items/07e69aef6c156d23c538>

しかしながら、前述した非特許文献１及び２に記載の技術によれば、音声付き映像及び字幕のようなマルチモーダル情報を用いているものの、ユーザの直前の発話文に応じた応答文を生成しているに過ぎない。そのために、発話文と応答文以外のオープンドメインの話題に対して、例えば雑談のような自然な対話を生成することは難しい。これは、結局、ユーザの直前の質問文（発話文）に対する回答文（応答文）との関係に過ぎない。 However, according to the technologies described in Non-Patent Documents 1 and 2 mentioned above, although multimodal information such as video with audio and subtitles is used, a response sentence is generated according to the user's previous utterance. It's just that. Therefore, it is difficult to generate natural dialogue, such as small talk, on open domain topics other than utterances and response sentences. This is, after all, nothing more than a relationship with the answer (response) to the user's previous question (utterance).

また、前述した非特許文献３に記載の技術によれば、学習時には、単語ペアのみを照合するために、その単語ペア以外の文脈の特徴量を全く考慮してない。前述の例の因果関係語｛（円安になる）->（景気が上昇）｝によれば、「貿易の視点」や「日本」のような制限となる特徴量が、全く含まれないこととなる。また、運用時には、実際のユーザの発話文に対して、予め学習された因果関係語が完全一致で照合しないと、リランキングを実現できないという問題もある。 Furthermore, according to the technique described in the above-mentioned Non-Patent Document 3, during learning, only word pairs are compared, and therefore no consideration is given to the feature amounts of the context other than the word pairs. According to the causal relationship term in the above example {(the yen weakens) -> (the economy rises)}, there are no restrictive features such as "trade perspective" or "Japan" included. becomes. In addition, during operation, there is a problem that reranking cannot be achieved unless pre-learned causal relation words are completely matched against sentences uttered by an actual user.

更に、非特許文献４に記載の技術によれば、因果関係を持つ対話データ（発話文及び応答文）のみを用いて学習モデルを作成するために、教師データとなる対話データに依存しすぎてしまう。これは、教師データにおけるユーザの発話文に対する応答文としての多様性や汎用性が乏しいという問題もある。 Furthermore, according to the technology described in Non-Patent Document 4, since a learning model is created using only dialogue data (utterances and response sentences) that have a causal relationship, the learning model is too dependent on dialogue data as training data. Put it away. This also has the problem of lack of diversity and versatility as responses to user utterances in the training data.

これに対し、本願の発明者らは、マルチモーダル情報としての字幕文付き映像を用いて、映像に字幕文を対応付けて学習させることによって、映像に対する言語の特徴を抽出することができるのではないか、と考えた。その上で、ユーザの発話文に対して複数の応答文の候補が推定できた際に、映像に対する言語の特徴に応じた応答文を選択することができるのではないか、と考えた。
これを実現するには少なくとも、画像（映像の中のフレーム）から、因果関係語（原因語及び結果語）を推定することができれば、その因果関係語に応じた応答文を返答することができるのではないか、と考えた。 On the other hand, the inventors of the present application believe that it is possible to extract the linguistic features of a video by using videos with subtitles as multimodal information and learning to associate subtitles with videos. I wondered if there was. Based on this, we thought that when multiple response sentence candidates can be estimated for a user's utterance, it would be possible to select a response sentence that corresponds to the language characteristics of the video.
To achieve this, at least if it is possible to infer the causal words (cause and result words) from the image (frame in the video), it is possible to respond with a response sentence that corresponds to the causal word. I thought that might be the case.

そこで、本発明は、画像から、因果関係語（原因語及び結果語）を推定することができるプログラム、装置及び方法を提供することを目的とする。そして、ユーザ周辺のマルチモーダル情報から因果関係語を推定し、ユーザの発話文に対してその因果関係に応じた応答文を返答することによって、ユーザとできる限り自然に対話させること目的とする。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a program, a device, and a method that can estimate causal words (cause and result words) from images. The purpose is to make the interaction with the user as natural as possible by estimating causal relation words from multimodal information around the user and responding to the user's utterances with a response sentence according to the causal relation.

本発明によれば、画像から、原因語及び結果語の因果関係語を推定するようにコンピュータを機能させるプログラムであって、
教師データとして、画像と、当該画像に紐付く字幕文とが対応付けられており、
訓練段階について、
字幕文の特徴ベクトルを入力し、因果関係有りと推定された字幕文から原因語及び結果語を推定する因果関係学習エンジンと、
画像の特徴ベクトルを入力し、因果関係学習エンジンによって推定された原因語及び結果語を出力するように訓練する画像学習エンジンと
して機能させ、
推定段階について、
画像学習エンジンは、対象データとしての画像を入力し、原因語及び結果語を出力する
ようにコンピュータを機能させることを特徴とする。 According to the present invention, there is provided a program that causes a computer to function to estimate causal relation words of a cause word and a result word from an image,
As training data, images are associated with subtitles associated with the images.
Regarding the training stage,
a causal relationship learning engine that inputs feature vectors of subtitle sentences and estimates cause and effect words from subtitle sentences that are estimated to have a causal relationship;
Function as an image learning engine that inputs image feature vectors and trains to output cause and effect words estimated by the causal relationship learning engine,
Regarding the estimation stage,
The image learning engine is characterized by inputting images as target data and causing a computer to function so as to output cause words and result words.

本発明のプログラムにおける他の実施形態によれば、
訓練段階について、
画像学習エンジンは、敵対的生成ネットワークによって構成されており、
画像の特徴ベクトルを入力する生成器と、
生成器から出力された原因語及び結果語と、因果関係推定手段から出力された原因語及び結果語とを入力する識別器と
して訓練する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
Regarding the training stage,
The image learning engine is composed of a generative adversarial network,
a generator inputting a feature vector of an image;
It is also preferable that the computer be trained as a discriminator that receives the cause and effect words output from the generator and the cause and effect words output from the causality estimation means.

本発明のプログラムにおける他の実施形態によれば、
生成器は、Transformerに基づくものであり、
識別器は、分類型の畳み込みニューラルネットワークに基づくものである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The generator is based on Transformer,
Preferably, the computer functions such that the discriminator is based on a convolutional neural network of the classification type.

本発明のプログラムにおける他の実施形態によれば、
因果関係学習エンジンは、
訓練時に、
文の前後を因果関係で接続する接続助詞を予め登録しており、教師データの字幕文を入力し、接続助詞を含む字幕文を選別する字幕文選別手段と
選別された字幕文を入力層へ入力し、第１出力層から原因語が出力され、第２出力層から結果語が出力されるように、マルチタスク深層学習モデルとして学習する因果関係語推定手段と
してコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The causal learning engine is
During training,
Conjunctive particles that connect sentences before and after sentences in a causal relationship are registered in advance, and a subtitle sentence selection means that inputs subtitle sentences from teacher data and selects subtitle sentences that include conjunctive particles, and sends the selected subtitle sentences to the input layer. It is also preferable for the computer to function as a causal relationship word estimating means that is trained as a multi-task deep learning model so that the cause word is input, the cause word is output from the first output layer, and the result word is output from the second output layer.

本発明のプログラムにおける他の実施形態によれば、
因果関係語推定手段は、
入力層と、
埋め込み層と、
当該埋め込み層から分岐した第１再帰ネットワーク層、第１識別層及び第１出力層と、
当該埋め込み層から分岐した第２再帰ネットワーク層、第２識別層及び第２出力層としてコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The causal relationship term estimation means is
an input layer;
an embedded layer;
a first recursive network layer, a first identification layer, and a first output layer branched from the embedding layer;
It is also preferable that the computer function as a second recursive network layer, a second identification layer, and a second output layer branched from the embedding layer.

本発明のプログラムにおける他の実施形態によれば、
特徴ベクトルは、分散表現生成アルゴリズムによって生成されたものである
ように機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferred that the feature vectors act as if they were generated by a distributed representation generation algorithm.

本発明のプログラムにおける他の実施形態によれば、
訓練時に、
教師データは、映像と、当該映像を視聴している人物同士の発話文及び応答文の組を複数含む一連の対話文履歴とからなり、
対話文履歴における発話文及び応答文の組毎に、当該映像から画像を抽出するマルチモーダル情報抽出手段と、
発話文及び応答文の組毎に、発話文をエンコーダ側に入力し、デコーダ側から応答文を出力するように訓練する応答文推定エンジンと
して機能させ、
推定時に、
応答文推定エンジンは、ユーザの発話文を入力し、候補となる複数の応答文を出力し、
候補となる複数の応答文の中から、画像学習エンジンによって出力された結果語を含む又は類似する応答文を選択する応答文リランキング手段と
してコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
During training,
The training data consists of a video and a series of dialogue history including multiple sets of utterances and response sentences between the people viewing the video,
multimodal information extraction means for extracting an image from the video for each set of uttered sentences and response sentences in the dialogue history;
Functions as a response sentence estimation engine that trains each set of utterance sentences and response sentences to input the utterance sentences to the encoder side and output the response sentences from the decoder side,
When estimating,
The response sentence estimation engine inputs the user's utterance, outputs multiple candidate response sentences,
It is also preferable that the computer function as a response sentence reranking means that selects response sentences that include or are similar to the result word output by the image learning engine from among a plurality of candidate response sentences.

本発明のプログラムにおける他の実施形態によれば、
応答文推定エンジンは、汎用的な発話文及び応答文の間の特徴を抽出可能なSeq2Seqである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the computer functions as the response sentence estimation engine using Seq2Seq, which is capable of extracting features between a general-purpose utterance sentence and a response sentence.

本発明によれば、画像から、原因語及び結果語の因果関係語を推定する推定装置であって、
教師データとして、画像と、当該画像に紐付く字幕文とが対応付けられており、
訓練段階について、
字幕文の特徴ベクトルを入力し、因果関係有りと推定された字幕文から原因語及び結果語を推定する因果関係学習エンジンと、
画像の特徴ベクトルを入力し、因果関係学習エンジンによって推定された原因語及び結果語を出力するように訓練する画像学習エンジンと
を有し、
推定段階について、
画像学習エンジンは、対象データとしての画像を入力し、原因語及び結果語を出力する
ことを特徴とする。 According to the present invention, there is provided an estimation device for estimating a causal relation word between a cause word and a result word from an image,
As training data, images are associated with subtitles associated with the images.
Regarding the training stage,
a causal relationship learning engine that inputs feature vectors of subtitle sentences and estimates cause and effect words from subtitle sentences that are estimated to have a causal relationship;
and an image learning engine that trains to input image feature vectors and output cause and effect words estimated by the causal relationship learning engine,
Regarding the estimation stage,
The image learning engine is characterized by inputting images as target data and outputting cause words and result words.

本発明によれば、画像から、原因語及び結果語の因果関係語を推定する装置の推定方法であって、
教師データとして、画像と、当該画像に紐付く字幕文とが対応付けられており、
装置は、
訓練段階について、
字幕文の特徴ベクトルを入力し、因果関係有りと推定された字幕文から原因語及び結果語を推定する因果関係学習エンジンと、
画像の特徴ベクトルを入力し、因果関係学習エンジンによって推定された原因語及び結果語を出力するように訓練する画像学習エンジンと
を有し、
推定段階について、
画像学習エンジンは、対象データとしての画像を入力し、原因語及び結果語を出力する
ことを特徴とする。 According to the present invention, there is provided an estimation method of a device for estimating causal relation words of a cause word and a result word from an image, comprising:
As training data, images are associated with subtitles associated with the images.
The device is
Regarding the training stage,
a causal relationship learning engine that inputs feature vectors of subtitle sentences and estimates cause and effect words from subtitle sentences that are estimated to have a causal relationship;
and an image learning engine that trains to input image feature vectors and output cause and effect words estimated by the causal relationship learning engine,
Regarding the estimation stage,
The image learning engine is characterized by inputting images as target data and outputting cause words and result words.

本発明のプログラム、装置及び方法によれば、画像から、因果関係語（原因語及び結果語）を推定することができる。そして、ユーザ周辺のマルチモーダル情報から因果関係語を推定し、ユーザの発話文に対してその因果関係に応じた応答文を返答することによって、ユーザとできる限り自然に対話させることができる。 According to the program, device, and method of the present invention, causal words (cause word and result word) can be estimated from an image. Then, by estimating a causal relationship word from multimodal information around the user and responding to the user's utterance with a response sentence that corresponds to the causal relationship, it is possible to interact with the user as naturally as possible.

画像から因果関係語を推定する訓練時の推定装置の機能構成図である。FIG. 2 is a functional configuration diagram of an estimation device during training for estimating causal relation words from images. 字幕文選別部の説明図である。It is an explanatory diagram of a subtitle sentence selection part. 因果関係語推定部の説明図である。FIG. 3 is an explanatory diagram of a causal relationship word estimation unit. 画像から因果関係語を推定する推定時の推定装置の機能構成図である。It is a functional block diagram of the estimation device at the time of estimation which estimates a causal relationship word from an image. 教師データとしての字幕文付き映像から因果関係語を推定する訓練時の推定装置の機能構成図である。FIG. 2 is a functional configuration diagram of an estimation device during training that estimates causal words from videos with subtitles as teacher data. 映像から因果関係語を推定する推定時の推定装置の機能構成図である。FIG. 2 is a functional configuration diagram of an estimation device at the time of estimating a causal relationship word from a video. 映像に応じて発話文に対する応答文を返答する推定時の推定装置の機能構成図である。FIG. 3 is a functional configuration diagram of an estimation device during estimation that responds with a response sentence to an uttered sentence according to a video. 図７における具体例となるテキストの流れを表す説明図である。8 is an explanatory diagram showing the flow of text as a specific example in FIG. 7. FIG. 応答文推定エンジンの説明図である。FIG. 2 is an explanatory diagram of a response sentence estimation engine.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail using the drawings.

図面によれば、説明上、以下のように分類される。
最初に、図１～図４は、画像から因果関係語を推定する推定装置について説明する。図１～図３は＜訓練時＞として説明し、図４は＜推定時＞として説明する。
次に、図５及び図６は、映像から因果関係語を推定する推定装置について説明する。図５は＜訓練時＞として説明し、図６は＜推定時＞として説明する。
更に、図７～図９は、映像に応じて発話文に対する応答文を返答する推定装置について説明する。 According to the drawings, for the sake of explanation, they are classified as follows.
First, FIGS. 1 to 4 will explain an estimation device that estimates causal terms from images. 1 to 3 will be described as <during training>, and FIG. 4 will be described as <during estimation>.
Next, FIG. 5 and FIG. 6 will explain an estimation device that estimates a causal relation word from a video. FIG. 5 will be described as <during training>, and FIG. 6 will be described as <during estimation>.
Furthermore, FIGS. 7 to 9 explain an estimation device that responds with a response sentence to an uttered sentence according to a video.

＜画像から因果関係語を推定する推定装置における＜訓練時＞＞
図１は、画像から因果関係語を推定する訓練時の推定装置の機能構成図である。 <<During training> in the estimation device that estimates causal terms from images>
FIG. 1 is a functional configuration diagram of an estimation device during training that estimates causal relation words from images.

図１によれば、推定装置１は、訓練時に、画像特徴ベクトル生成部１０１と、言語特徴ベクトル生成部１０２と、因果関係学習エンジン１１と、画像学習エンジン１２とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、推定装置の訓練方法としても理解できる。
また、推定装置１には、教師データとして、画像と字幕文とが入力されている。教師データは、訓練時に外部から入力する必要がある。 According to FIG. 1, the estimation device 1 includes an image feature vector generation unit 101, a language feature vector generation unit 102, a causal relationship learning engine 11, and an image learning engine 12 during training. These functional components are realized by executing a program that causes a computer installed in the device to function. Furthermore, the processing flow of these functional components can also be understood as a training method for the estimation device.
Furthermore, images and subtitle sentences are input to the estimation device 1 as teacher data. Teacher data must be input externally during training.

［画像特徴ベクトル生成部１０１］
画像特徴ベクトル生成部１０１は、画像を入力し、その画像から画像特徴ベクトル（潜在空間のランダムベクトル）を生成する。その画像特徴ベクトルは、画像学習エンジン１２へ出力される。
画像特徴ベクトルは、具体的にはVisualBERT（登録商標）のような分散表現生成(embedding)アルゴリズムを適用し、高次元ベクトルに置き換えたものである。 [Image feature vector generation unit 101]
The image feature vector generation unit 101 receives an image and generates an image feature vector (random vector in latent space) from the image. The image feature vector is output to the image learning engine 12.
Specifically, the image feature vector is replaced with a high-dimensional vector by applying a distributed representation generation (embedding) algorithm such as VisualBERT (registered trademark).

［言語特徴ベクトル生成部１０２（字幕文用）］
言語特徴ベクトル生成部１０２は、画像に紐付く字幕文を入力し、形態素解析によって形態素に分析し、形態素毎に言語特徴ベクトル（潜在空間のランダムベクトル）を生成する。その言語特徴ベクトルは、因果関係学習エンジン１１へ出力される。
言語特徴ベクトルも、具体的にはBERT（登録商標）やGPT-2（登録商標）のような分散表現生成アルゴリズムを適用し、高次元ベクトルに置き換えたものである。 [Language feature vector generation unit 102 (for subtitle text)]
The language feature vector generation unit 102 inputs a subtitle sentence associated with an image, analyzes it into morphemes through morpheme analysis, and generates a language feature vector (random vector in latent space) for each morpheme. The linguistic feature vector is output to the causal relationship learning engine 11.
Specifically, the language feature vector is also replaced with a high-dimensional vector by applying a distributed representation generation algorithm such as BERT (registered trademark) or GPT-2 (registered trademark).

BERT(Bidirectional Encoder Representations from Transformers)とは、Transformerアーキテクチャ（例えば非特許文献３参照）による双方向学習のエンコード表現であり、Google（登録商標）の自然言語処理モデルである。映像や画像についてはVideoBERTやVisualBERTがある。BERTは、Seq2seqベースの事前学習モデルであり、ラベルが付与されていない特徴ベクトル（分散表現）をTransformerで処理して学習する。これは、連続する文章の中で、次に来る単語を単に予測するだけでなく、周りの文脈からからマスクされている単語を双方向で予測する。これによって、単語に対応する文脈情報を学習する。
また、GPT-2(Generative Pre-Training 2)は、Open AIに基づくものであり、自然言語に代えてピクセルで学習する。これによって、前半の映像（又は一部の画像）のシーケンスから、人間が感覚的に考えるであろう後半の映像（又は画像全体）を予測することができる。 BERT (Bidirectional Encoder Representations from Transformers) is an encoded representation of bidirectional learning using the Transformer architecture (for example, see Non-Patent Document 3), and is a natural language processing model of Google (registered trademark). For videos and images, there are VideoBERT and VisualBERT. BERT is a Seq2seq-based pre-learning model that learns by processing unlabeled feature vectors (distributed representations) with a Transformer. It not only predicts the next word in a sequence of sentences, but also bidirectionally predicts words that are masked from the surrounding context. In this way, context information corresponding to the word is learned.
Additionally, GPT-2 (Generative Pre-Training 2) is based on Open AI and uses pixels to learn instead of natural language. As a result, it is possible to predict the second half of the video (or the entire image) that humans would intuitively think of from the first half of the video (or some images) sequence.

［因果関係学習エンジン１１］
因果関係学習エンジン１１は、言語特徴ベクトル生成部１０２から、教師データとしての字幕文の特徴ベクトルを入力し、因果関係有りと推定された字幕文から原因語及び結果語（因果関係語）を推定する。
図１によれば、因果関係学習エンジン１１は、字幕文選別部１１１と、因果関係語推定部１１２とを有する。 [Causal relationship learning engine 11]
The causal relationship learning engine 11 inputs feature vectors of subtitle sentences as training data from the linguistic feature vector generation unit 102, and estimates cause words and result words (causal relationship words) from the subtitle sentences estimated to have a causal relationship. do.
According to FIG. 1, the causal relationship learning engine 11 includes a subtitle sentence selection section 111 and a causal relationship word estimation section 112.

［字幕文選別部１１１］
字幕文選別部１１１は、文の前後を因果関係で接続する接続助詞を予め登録している。その上で、字幕文選別部１１１は、教師データの字幕文を入力し、接続助詞を含む字幕文を選別する。 [Subtitle sentence selection unit 111]
The subtitle sentence selection unit 111 has registered in advance conjunctive particles that connect the sentences before and after the sentences in a causal relationship. Then, the subtitle sentence selection unit 111 inputs the subtitle sentences of the teacher data and selects subtitle sentences that include conjunctive particles.

図２は、字幕文選別部の説明図である。 FIG. 2 is an explanatory diagram of the subtitle sentence selection section.

図２によれば、字幕文選別部１１１は、分類型のニューラルネットワークであり、コーパスデータによって予め訓練されたものである。ここでは、コーパスデータの訓練時と、字幕文の分類判定の推定時とに分けられる。 According to FIG. 2, the subtitle sentence selection unit 111 is a classification type neural network, which is trained in advance using corpus data. Here, the process is divided into the time of training the corpus data and the time of estimating the subtitle sentence classification determination.

（コーパスデータの訓練時）
字幕文選別部１１１は、コーパスデータを入力し、因果関係有りとなる文章全体の表現の特徴を網羅的に抽出した深層学習モデルを構築する。
コーパスデータは、インターネット上で、自然言語の文章を構造化して大規模に集積した大量の「コーパス」である。これは、例えばウィキペディア(Wikipedia)（登録商標）のような百科事典であって、自然言語として正当な文章群である。勿論、Ｗｅｂサイトにおける自然言語知識のコンテンツの文章群であってもよい。 (When training on corpus data)
The subtitle sentence selection unit 111 inputs corpus data and constructs a deep learning model that comprehensively extracts expression features of the entire sentence that have a causal relationship.
Corpus data is a large-scale "corpus" that is a large-scale collection of structured natural language sentences on the Internet. This is, for example, an encyclopedia such as Wikipedia (registered trademark), and is a group of sentences that are valid as natural language. Of course, it may also be a group of sentences of natural language knowledge content on a website.

大規模なコーパスに含まれる文章の群から、接続助詞テーブルに登録された「接続助詞」を含む学習文章を選別する。接続助詞を含む文は、接続助詞を挟んで、因果関係となる原因語及び結果語を含む場合が多い。
接続助詞テーブルは、文章中の前後を因果関係で接続する接続助詞を登録したものである。「接続助詞」とは、前文と後文との間に因果関係を構築する助詞であり、因果関係の手がかりとなるものである。
例えば、以下のような助詞がある。
「～ため、～」「～から、～」「～により、～」「～によって、～」
「～を背景に、～」「～を受け、～」「～の結果、～」「～をきっかけに、～」
「～の影響、～」「～の原因、～」「～を行うと、～」「～すれば、～」
「～しないと、～」「～に伴い、～」「～を反映し、～」 From a group of sentences included in a large-scale corpus, training sentences containing "conjunctive particles" registered in the conjunctive particle table are selected. Sentences that include a conjunctive particle often include a cause word and a result word that are in a causal relationship with the conjunctive particle in between.
The conjunctive particle table is a register of conjunctive particles that connect the preceding and following sentences in a causal relationship. A "conjunctive particle" is a particle that establishes a causal relationship between a preceding sentence and a subsequent sentence, and serves as a clue to the causal relationship.
For example, there are particles such as:
"for,""from,""by,""by,"
"In the background of...""In response to...""As a result of...""In the wake of..."
"The effect of...""The cause of...""If you do...""If you do..."
"If we don't...""In accordance with...""Reflecting..."

字幕文選別部１１１は、例えば以下のようなコーパスデータを入力したとする。
コーパスデータ：「手を切った｛ため｝血が出た」
このコーパスデータは、接続助詞として「ため」を含むと判定し、その接続助詞を削除して連結して、以下のような文を作成する。
因果関係語：「手を切った、血が出た」
その因果関係後は、データ前処理、畳み込み層、プーリング層、全結合層、識別層によって、因果関係有りとして訓練される。 It is assumed that the subtitle sentence selection unit 111 receives, for example, the following corpus data.
Corpus data: “I cut my hand and it bled.”
This corpus data is determined to include "tame" as a conjunctive particle, and the following conjunctive particle is deleted and concatenated to create the following sentence.
Causal words: “I cut my hand, there was blood”
After the causal relationship is established, data preprocessing, convolution layer, pooling layer, fully connected layer, and discrimination layer are used to train the data as having a causal relationship.

（字幕文の分類判定の推定時）
字幕文選別部１１１は、因果関係が不明な字幕文を入力すると、形態素分析によって形態素に区分した上で、因果関係有り又は無しの分類結果を出力する。
字幕文選別部１１１は、言語特徴ベクトル生成部１０２から、以下のような特徴ベクトルの字幕文を入力したとする。
字幕文：「手を切って、血が出た」
この字幕文は、データ前処理、畳み込み層、プーリング層、全結合層、識別層によって、「因果関係有り」として判定される。
そして、字幕文選別部１１１は、因果関係有りと推定した字幕文のみを、因果関係語推定部１１２へ出力する。
字幕文：「手を切って、血が出た」 (When estimating the classification judgment of subtitle sentences)
When the subtitle sentence selection unit 111 receives a subtitle sentence with unknown causal relationship, it classifies it into morphemes through morphological analysis and outputs a classification result indicating whether there is a causal relationship.
It is assumed that the subtitle sentence selection unit 111 receives subtitle sentences with the following feature vectors from the language feature vector generation unit 102.
Subtitle: "I cut my hand and it bled"
This subtitle sentence is determined to have a "causal relationship" by data preprocessing, convolution layer, pooling layer, fully connected layer, and discrimination layer.
Then, the subtitle sentence selection unit 111 outputs only subtitle sentences estimated to have a causal relationship to the causal relationship word estimation unit 112.
Subtitle: "I cut my hand and it bled"

［因果関係語推定部１１２］
因果関係語推定部１１２は、選別された字幕文を入力層へ入力し、第１出力層から原因語が出力され、第２出力層から結果語が出力されるように、マルチタスク深層学習モデルとして学習する。
因果関係語推定部１１２は、字幕文選別部１１１から、因果関係有りと推定された字幕文のみを入力し、その字幕文における原因語及び結果語（因果関係語）それぞれを出力する。 [Causal relation word estimation unit 112]
The causal relation word estimation unit 112 inputs the selected subtitle sentences to the input layer, and uses a multi-task deep learning model so that the cause words are output from the first output layer and the result words are output from the second output layer. Learn as.
The causal relationship word estimating unit 112 inputs only subtitle sentences estimated to have a causal relationship from the subtitle sentence selection unit 111, and outputs each of the cause word and result word (causal relationship word) in the subtitle sentence.

図３は、因果関係語推定部の説明図である。 FIG. 3 is an explanatory diagram of the causal relationship word estimation unit.

図３によれば、因果関係語推定部１１２は、以下のように２つの系列に分岐して構成される。
｛（原因語）->（結果語）｝
-> 第１再帰ネットワーク層 -> 第１識別層 -> 原因語出力層
-> 第２再帰ネットワーク層 -> 第２識別層 -> 結果語出力層 According to FIG. 3, the causal relationship word estimating unit 112 is configured to branch into two series as follows.
{(Cause word)->(Result word)}
-> 1st recursive network layer -> 1st discrimination layer -> Cause word output layer
-> 2nd recursive network layer -> 2nd identification layer -> Result word output layer

第１再帰ネットワーク層及び第２再帰ネットワーク層は、同一のＲＮＮ(Recurrent Neural Network)である。ＲＮＮは、学習文章の時系列データをそのまま入力することによって、時間依存性を学習することができるモデルである。ＲＮＮとしては、例えばＬＳＴＭ(Long Short-Term Memory)やＧＲＵ(Gated Recurrent Unit)を用いることができる。ＬＳＴＭは、複数のブロックを並べて、各ブロックが、誤差を内部に留まらせて勾配消失を防ぐセルと、必要な情報を必要なタイミングで保持・消却させる入力ゲート、出力ゲート及び忘却ゲートとから構成されている。ＧＲＵは、ＬＳＴＭを簡略化したものであり、リセットゲートと更新ゲートとからなる。
第１識別層及び第２識別層は、同一の識別器(Discriminator)であり、第１識別層は原因語を識別し、第２識別層は結果語を識別する。
最終的に、原因語出力層は原因語を出力し、結果語出力層は結果語を出力する。 The first recurrent network layer and the second recurrent network layer are the same RNN (Recurrent Neural Network). RNN is a model that can learn time dependence by inputting time-series data of learning sentences as they are. As the RNN, for example, LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) can be used. LSTM consists of multiple blocks lined up, each block consisting of a cell that keeps errors inside to prevent gradient disappearance, and input gates, output gates, and forgetting gates that retain and erase necessary information at the necessary timing. has been done. GRU is a simplified version of LSTM and consists of a reset gate and an update gate.
The first discrimination layer and the second discrimination layer are the same discriminator, the first discrimination layer discriminating cause words, and the second discrimination layer discriminating result words.
Finally, the cause word output layer outputs the cause word, and the result word output layer outputs the result word.

［画像学習エンジン１２］
画像学習エンジン１２は、画像の特徴ベクトルを入力し、因果関係学習エンジン１１によって推定された原因語及び結果語を出力するように訓練する。
画像学習エンジン１２は、例えばＧＡＮ(Generative adversarial networks)のような敵対的生成ネットワークによって構成されており、生成器(Generator)と識別器(Discriminator)とからなる。
生成器１２１は、画像の特徴ベクトルを入力し、原因語及び結果語を出力する。生成器は、Transformerに基づくものである。
識別器１２２は、生成器１２１から出力された原因語及び結果語と、因果関係語推定部１１２から出力された原因語及び結果語とを入力する。識別器は、分類型の畳み込みニューラルネットワークに基づくものである。識別器１２２は、生成器１２１から出力された原因語及び結果語が、本物か偽物かを予測して出力する。識別器１２２の予測結果は、生成器１２１へフィードバックされる。結果的に、生成器１２１は、識別器１２２が誤るような偽物の原因語及び結果語を生成するように学習していき、識別器１２２は、その偽物の原因語及び結果語を判別できるように学習していく。 [Image learning engine 12]
The image learning engine 12 is trained to input image feature vectors and output the cause and effect words estimated by the causal relationship learning engine 11.
The image learning engine 12 is configured by a generative adversarial network such as a GAN (Generative adversarial network), and includes a generator and a discriminator.
The generator 121 inputs a feature vector of an image and outputs a cause word and a result word. The generator is based on Transformer.
The discriminator 122 inputs the cause word and result word output from the generator 121 and the cause word and result word output from the causal relation word estimation unit 112. The classifier is based on a classification type convolutional neural network. The discriminator 122 predicts and outputs whether the cause word and result word output from the generator 121 are genuine or fake. The prediction result of the discriminator 122 is fed back to the generator 121. As a result, the generator 121 learns to generate fake cause words and result words that the classifier 122 makes mistakes, and the classifier 122 learns to generate fake cause words and result words that the classifier 122 makes mistakes. I will continue to learn.

＜画像から因果関係語を推定する推定装置における＜推定時＞＞
図４は、画像から因果関係語を推定する推定時の推定装置の機能構成図である。 <<At the time of estimation> in the estimation device that estimates causal terms from images>
FIG. 4 is a functional configuration diagram of an estimation device for estimating a causal relation word from an image.

図４によれば、画像特徴ベクトル生成部１０１と、画像学習エンジン１２の生成器１２１とからなる。
画像特徴ベクトル生成部１０１は、任意の画像を入力し、その画像から画像特徴ベクトルを生成する。その画像特徴ベクトルは、画像学習エンジン１２の生成器１２１へ出力される。
生成器１２１は、特徴ベクトルの画像を入力し、原因語及び結果語（因果関係語）を出力する。
図４によれば、任意の画像から、以下のような因果関係語が出力されている。
因果関係語「手を切る、血が出る」 According to FIG. 4, it consists of an image feature vector generation unit 101 and a generator 121 of the image learning engine 12.
The image feature vector generation unit 101 inputs an arbitrary image and generates an image feature vector from the image. The image feature vector is output to the generator 121 of the image learning engine 12.
The generator 121 inputs an image of a feature vector and outputs a cause word and a result word (causal relationship word).
According to FIG. 4, the following causal relation words are output from an arbitrary image.
Causal words “cut your hand, bleed”

＜映像から因果関係語を推定する推定装置＜訓練時＞＞
図５は、教師データとしての字幕文付き映像から因果関係語を推定する訓練時の推定装置の機能構成図である。 <Estimation device that estimates causality words from videos <During training>>
FIG. 5 is a functional configuration diagram of an estimation device during training that estimates causal words from videos with subtitles as teacher data.

推定装置１は、ユーザとの間で、自然な対話を実現するものであり、ユーザの発話文に対する応答文を生成する。
図５によれば、サーバ機能を有する推定装置１は、ユーザインタフェース機能を有する端末２と通信する。端末２は、ユーザに対する入出力デバイスとして、マイクによってユーザの音声を取得し、スピーカによってユーザへ発声するものであってもよいし、ユーザからテキストベースの発話文を入力し、応答文を表示するものであってもよい。
尚、音声認識機能は、推定装置１に搭載されたものであってもよいし、端末２に搭載されていてもよい。 The estimation device 1 realizes a natural dialogue with a user, and generates a response sentence to the user's utterance.
According to FIG. 5, an estimation device 1 having a server function communicates with a terminal 2 having a user interface function. As an input/output device for the user, the terminal 2 may acquire the user's voice through a microphone and speak to the user through a speaker, or may input text-based utterances from the user and display a response sentence. It may be something.
Note that the voice recognition function may be installed in the estimation device 1 or may be installed in the terminal 2.

図５によれば、図１と比較して、マルチモーダル情報抽出部１０３を更に有し、訓練時における教師データを入力する。ここで、教師データは、過去に記録された大量のマルチモーダル情報としての「字幕文付き映像」である。 According to FIG. 5, compared to FIG. 1, it further includes a multimodal information extraction unit 103, and inputs teacher data during training. Here, the teacher data is "video with subtitles" as a large amount of multimodal information recorded in the past.

［マルチモーダル情報抽出部１０３］
マルチモーダル情報抽出部１０３は、マルチモーダル情報に対して、画像の抽出機能と、字幕文の抽出機能とを有する。
マルチモーダル情報抽出部１０３は、教師データの一連の映像の中で、字幕文毎に、映像からサンプリング的な１枚の画像を抽出する。例えば、１つの字幕文に対して、映像の中でその時点の１枚のフレームとなる画像が抽出される。
映像付き字幕文から抽出された画像は、画像特徴ベクトル生成部１０１へ出力される。
また、映像付き字幕文から抽出された字幕文は、言語特徴ベクトル生成部１０２へ出力される。 [Multimodal information extraction unit 103]
The multimodal information extraction unit 103 has an image extraction function and a subtitle sentence extraction function for multimodal information.
The multimodal information extraction unit 103 extracts one sampling image from a series of videos of teacher data for each subtitle sentence. For example, for one subtitle sentence, an image that is one frame at that point in the video is extracted.
The image extracted from the video-attached subtitle text is output to the image feature vector generation unit 101.
Further, the subtitle sentence extracted from the video-attached subtitle sentence is output to the language feature vector generation unit 102.

＜映像から因果関係語を推定する推定装置＜推定時＞＞
図６は、映像から因果関係語を推定する推定時の推定装置の機能構成図である。 <Estimation device for estimating causal relation words from video <Estimation time>>
FIG. 6 is a functional configuration diagram of an estimation device for estimating a causal relation word from a video.

図６によれば、推定装置１は、ユーザインタフェース機能となる端末２と通信する。端末２は、ユーザ周辺のマルチモーダル情報を取得可能なデバイスを搭載している。少なくともユーザが視聴中の映像を撮影可能なカメラ（又はテレビやディスプレイへの接続インタフェース）を搭載する。このような端末２としては、一般的なスマートフォンであってもよいし、例えば「SOTA（登録商標）」「ユニボー（登録商標）」のようなロボットであってもよい。また、カメラを備えた「Google Home（登録商標）」や「Amazon Echo（登録商標）」のようなスマートスピーカであってもよい。 According to FIG. 6, the estimation device 1 communicates with a terminal 2 serving as a user interface function. The terminal 2 is equipped with a device that can acquire multimodal information around the user. It is equipped with at least a camera (or a connection interface to a TV or display) that can capture the video the user is viewing. Such a terminal 2 may be a general smartphone or a robot such as "SOTA (registered trademark)" or "Unibo (registered trademark)". Alternatively, it may be a smart speaker such as "Google Home (registered trademark)" or "Amazon Echo (registered trademark)" equipped with a camera.

図６によれば、図４と比較して、マルチモーダル情報抽出部１０３を更に有し、端末２から推定対象となる映像を受信する。この映像は、マルチモーダル情報抽出部１０３へ入力され、画像が抽出される。その画像は、前述した図４と同様に、画像特徴ベクトル生成部１０１へ入力される。 According to FIG. 6, compared to FIG. 4, it further includes a multimodal information extraction unit 103, and receives a video to be estimated from the terminal 2. This video is input to the multimodal information extraction unit 103, and an image is extracted. The image is input to the image feature vector generation unit 101 in the same manner as in FIG. 4 described above.

即ち、前述した図５における訓練時では、教師データの字幕文付き映像について処理されるのに対し、図６における推定時では、通信インタフェースによってリアルタイムに受信した推定対象の画像について処理される。 That is, during the training shown in FIG. 5 described above, the subtitled video of the teacher data is processed, whereas during the estimation shown in FIG. 6, the estimation target image received in real time via the communication interface is processed.

＜映像に応じて発話文に対する応答文を返答する推定装置＞
図７は、映像に応じて発話文に対する応答文を返答する推定時の推定装置の機能構成図である。 <Estimation device that responds with a response sentence to the uttered sentence according to the video>
FIG. 7 is a functional configuration diagram of an estimation device at the time of estimation, which responds with a response sentence to an uttered sentence according to a video.

図７によれば、前述した図６と比較して、言語特徴ベクトル生成部１０２と、応答文推定エンジン１３と、応答文リランキング部１４と、言語変換部１５とを更に有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、推定装置の推定方法としても理解できる。 According to FIG. 7, compared to FIG. 6 described above, it further includes a language feature vector generation section 102, a response sentence estimation engine 13, a response sentence reranking section 14, and a language conversion section 15. These functional components are realized by executing a program that causes a computer installed in the device to function. Further, the processing flow of these functional components can also be understood as an estimation method of the estimation device.

図７によれば、図６と比較して、端末２から、ユーザ周辺のマルチモーダル情報としての映像と共に、ユーザによる発話文とを受信し、ユーザへ応答文を返信する。
端末２は、ユーザの発話音声を収音するマイクと、ユーザへ応答音声を出力するスピーカとを搭載する。また、ユーザの発話音声をテキストベースの発話文に変換する音声認識機能と、テキストベースの応答文をユーザに向けた音声に変換する音声変換機能とは、端末２が搭載するものあってもよいし、推定装置１が搭載するものであってもよい。勿論、発話文と応答文とは、音声に限らず、キー入力とディスプレイ表示に基づくものであってもよい。
更に、端末２から送信されるマルチモーダル情報は、ユーザの発話音声が映像に混在したものであってもよい。その場合、マルチモーダル情報抽出部１０３は、映像と発話文とを分離して抽出するものであってもよい。 According to FIG. 7, compared to FIG. 6, a video as multimodal information around the user and a sentence uttered by the user are received from the terminal 2, and a response sentence is sent back to the user.
The terminal 2 is equipped with a microphone that picks up the user's uttered voice and a speaker that outputs a response voice to the user. Furthermore, the terminal 2 may be equipped with a voice recognition function that converts the user's uttered voice into a text-based utterance, and a voice conversion function that converts a text-based response sentence into voice directed to the user. However, it may be installed in the estimation device 1. Of course, the uttered sentence and the response sentence are not limited to voice, and may be based on key input and display display.
Furthermore, the multimodal information transmitted from the terminal 2 may be information in which the user's uttered voice is mixed with the video. In that case, the multimodal information extraction unit 103 may separate and extract the video and the utterance.

図８は、図７における具体例となるテキストの流れを表す説明図である。 FIG. 8 is an explanatory diagram showing the flow of text as a specific example in FIG.

図８によれば、推定装置１の対話エージェントのキャラクタＸと、ユーザＹとが対話している。このとき、ユーザＹが視聴している映像と、ユーザＹの発話文とが、端末２から推定装置１へ送信される。また、推定装置１は、キャラクタＸが返答すべき応答文を、端末２へ送信する。これによって、ユーザＹは、端末２に表示されたキャラクタＸと、音声によって対話をすることができる。 According to FIG. 8, character X, who is a dialogue agent of estimation device 1, and user Y are having a dialogue. At this time, the video that user Y is viewing and the sentence uttered by user Y are transmitted from the terminal 2 to the estimation device 1. Furthermore, the estimation device 1 transmits a response sentence to which the character X should respond to the terminal 2. This allows the user Y to interact with the character X displayed on the terminal 2 through voice.

映像は、ユーザＹとキャラクタＸとの間で共通認識となるマルチモーダル情報である。また、ユーザＹとキャラクタＸとの間の対話は、その映像を一緒に視聴している人物同士の「発話文及び応答文の組」となる。 The video is multimodal information that is shared by user Y and character X. Furthermore, the dialogue between user Y and character X becomes a "set of utterances and response sentences" between the people who are viewing the video together.

［言語特徴ベクトル生成部１０２（発話文用）］
言語特徴ベクトル生成部１０２は、ユーザの発話文を入力し、形態素解析によって形態素に分析し、形態素毎に言語特徴ベクトルを生成する。その機能は、図５と同様のものである。その言語特徴ベクトルは、応答文推定エンジン１３へ出力される。
図８によれば、以下の発話文が、言語特徴ベクトル生成部１０２へ入力されている。
ユーザＹの発話文：「得意でないので、絶対、手を切りそう」 [Language feature vector generation unit 102 (for utterances)]
The language feature vector generation unit 102 inputs a user's utterance, analyzes it into morphemes through morpheme analysis, and generates a language feature vector for each morpheme. Its function is similar to that in FIG. The language feature vector is output to the response sentence estimation engine 13.
According to FIG. 8, the following utterances are input to the language feature vector generation unit 102.
User Y's utterance: “I'm not good at it, so I'm definitely going to cut off.”

［応答文推定エンジン１３］
応答文推定エンジン１３は、訓練時に、教師データとしての対話コーパス（発話文及び応答文の組）毎に、発話文をエンコーダ側に入力し、デコーダ側から応答文を出力するように訓練したものである。ここでは、因果関係に拘わらず、汎用的に発話文及び応答文の関係の特徴を学習したものである。 [Response sentence estimation engine 13]
During training, the response sentence estimation engine 13 is trained to input utterances to the encoder side and output response sentences from the decoder side for each dialogue corpus (a set of utterance sentences and response sentences) as training data. It is. Here, the characteristics of the relationship between the uttered sentence and the response sentence are learned in a general manner, regardless of the causal relationship.

図９は、応答文推定エンジンの説明図である。 FIG. 9 is an explanatory diagram of the response sentence estimation engine.

応答文推定エンジン１３は、汎用的な発話文及び応答文の間の特徴を抽出可能なSeq2Seqであってもよいし、seq2seq+attentionやtransformのような改良モデルであってもよい。
seq2seqは、形態素文字列を入力して、別の形態素文字列を出力する置き換えルールを学習するニューラルネットワークである。これによって、発話文に対して複数のの応答文を学習していく。勿論、文字列の依存関係を学習可能なＲＮＮ(Recurrent Neural Network)の一種である例えばＬＳＴＭ(Long Short-Term Memory)であってもよい。 The response sentence estimation engine 13 may be a general-purpose Seq2Seq that can extract features between an uttered sentence and a response sentence, or may be an improved model such as seq2seq+attention or transform.
seq2seq is a neural network that inputs a morpheme string and learns replacement rules that output another morpheme string. In this way, multiple response sentences are learned for each uttered sentence. Of course, for example, LSTM (Long Short-Term Memory), which is a type of RNN (Recurrent Neural Network) capable of learning character string dependencies, may also be used.

これによって、応答文推定エンジン１３は、推定時に、エンコーダ側にユーザの発話文が入力されると、デコーダ側から候補となる複数の応答文を出力する。候補となる複数の応答文は、応答文リランキング部１４へ出力される。 As a result, when a user's utterance is input to the encoder during estimation, the response sentence estimation engine 13 outputs a plurality of candidate response sentences from the decoder. The plurality of candidate response sentences are output to the response sentence reranking unit 14.

図８によれば、応答文推定エンジン１３は、応答文リランキング部１４から、以下の発話文が入力されたとする。
ユーザＹの発話文：「得意でないので、絶対、手を切りそう」
これに対して、応答文推定エンジン１３は、以下の複数の応答文を出力する。
応答文候補１：「大丈夫ですよ」
応答文候補２：「白い皮も剥いて」
応答文候補３：「血が出るよ」
応答文候補４：「気をつけて」
応答文候補５：「得意じゃないですね」 According to FIG. 8, it is assumed that the response sentence estimation engine 13 receives the following uttered sentence from the response sentence reranking unit 14.
User Y's utterance: “I'm not good at it, so I'm definitely going to cut off.”
In response, the response sentence estimation engine 13 outputs the following plural response sentences.
Candidate response sentence 1: “It’s okay.”
Candidate response sentence 2: “Peel off the white skin too.”
Candidate response sentence 3: “I’m bleeding.”
Candidate response sentence 4: “Be careful.”
Candidate response sentence 5: “I’m not good at it.”

［応答文リランキング部１４］
応答文リランキング部１４は、応答文推定エンジン１３から出力された候補となる複数の応答文の中から、画像学習エンジン１２によって出力された結果語を含む又は類似する応答文を選択する。語は、特徴ベクトル化されているので、類似度の比較も可能となる。
選択された応答文は、言語変換部１５へ出力される。 [Response sentence reranking unit 14]
The response sentence reranking unit 14 selects a response sentence that includes or is similar to the result word output by the image learning engine 12 from among the plurality of candidate response sentences output from the response sentence estimation engine 13. Since the words are converted into feature vectors, it is also possible to compare the degree of similarity.
The selected response sentence is output to the language conversion section 15.

図８によれば、応答文リランキング部１４は、画像学習エンジン１２から因果関係語（手を切る->血が出る）を入力することによって、候補となる複数の応答文の中から、以下の応答文を選択する。
応答文候補３：「血が出るよ」 According to FIG. 8, the response sentence reranking unit 14 selects the following response sentences from among a plurality of candidate response sentences by inputting causal relation words (cut your hand -> blood comes out) from the image learning engine 12. Select the response sentence.
Candidate response sentence 3: “I’m bleeding.”

［言語変換部１５］
言語変換部１５は、前述した言語特徴ベクトル生成部１０２と逆の機能であって、応答文リランキング部１４から出力された応答文の特徴ベクトルを、応答文のテキストに変換する。変換された応答文は、通信インタフェースを介して端末２へ送信される。 [Language conversion unit 15]
The language conversion unit 15 has a function opposite to that of the language feature vector generation unit 102 described above, and converts the response sentence feature vector output from the response sentence reranking unit 14 into the text of the response sentence. The converted response sentence is sent to the terminal 2 via the communication interface.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、画像から、因果関係語（原因語及び結果語）を推定することができる。そして、ユーザ周辺のマルチモーダル情報から因果関係語を推定し、ユーザの発話文に対してその因果関係に応じた応答文を返答することによって、ユーザとできる限り自然に対話させることができる。
特に、本発明によれば、発話文及び応答文の文脈関係の特徴を汎用的且つ網羅的に学習した応答文推定エンジンを用いることによって、候補となる複数の応答文の中から、できる限り因果関係を持つ応答文で返答することができる。 As described above in detail, according to the program, device, and method of the present invention, causal words (cause word and result word) can be estimated from an image. Then, by estimating a causal relationship word from multimodal information around the user and responding to the user's utterance with a response sentence that corresponds to the causal relationship, it is possible to interact with the user as naturally as possible.
In particular, according to the present invention, by using a response sentence estimation engine that has learned the characteristics of the context relationship of uttered sentences and response sentences in a general and exhaustive manner, it is possible to select causal sentences from among a plurality of candidate response sentences. You can respond with a related response sentence.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Regarding the various embodiments of the present invention described above, various changes, modifications, and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above description is merely an example and is not intended to be limiting in any way. The invention is limited only by the claims and their equivalents.

１推定装置
１０１画像特徴ベクトル生成部
１０２言語特徴ベクトル生成部
１０３マルチモーダル情報抽出部
１１因果関係学習エンジン
１１１字幕文選別部
１１２因果関係語推定部
１２画像学習エンジン
１２１生成器
１２２識別器
１３応答文推定エンジン
１４応答文リランキング部
１５言語変換部
２端末
1 Estimation device 101 Image feature vector generation unit 102 Linguistic feature vector generation unit 103 Multimodal information extraction unit 11 Causal relationship learning engine 111 Subtitle sentence selection unit 112 Causal relationship word estimation unit 12 Image learning engine 121 Generator 122 Discriminator 13 Response sentence Estimation engine 14 Response sentence reranking unit 15 Language conversion unit 2 Terminal

Claims

A program that causes a computer to function to estimate causal relations between a cause word and a result word from an image, the program comprising:
As training data, images are associated with subtitles associated with the images.
Regarding the training stage,
a causal relationship learning engine that inputs feature vectors of subtitle sentences and estimates cause and effect words from subtitle sentences that are estimated to have a causal relationship;
Function as an image learning engine that inputs image feature vectors and trains to output cause and effect words estimated by the causal relationship learning engine,
Regarding the estimation stage,
The image learning engine is a program that operates a computer to input images as target data and output cause words and result words.

Regarding the training stage,
The image learning engine is composed of a generative adversarial network,
a generator inputting a feature vector of an image;
Claim 1, characterized in that the computer is operated to train as a discriminator that receives cause and effect words output from the generator and cause and effect words output from the causal relationship estimation means. Programs listed.

The generator is based on Transformer,
3. The program according to claim 2, characterized in that the discriminator causes the computer to function as if it were based on a classification type convolutional neural network.

The causal learning engine is
During training,
Conjunctive particles that connect sentences before and after sentences in a causal relationship are registered in advance, and a subtitle sentence selection means that inputs subtitle sentences from teacher data and selects subtitle sentences that include conjunctive particles, and sends the selected subtitle sentences to the input layer. The computer functions as a causal relationship word estimating means that is trained as a multi-task deep learning model such that a cause word is inputted, a cause word is output from the first output layer, and a result word is output from the second output layer. The program according to any one of claims 1 to 3.

The causal relationship term estimation means is
an input layer;
an embedded layer;
a first recursive network layer, a first identification layer, and a first output layer branched from the embedding layer;
5. The program according to claim 4, causing a computer to function as a second recursive network layer, a second identification layer, and a second output layer branched from the embedding layer.

6. The program according to claim 1, wherein the program causes the feature vector to function as if it were generated by a distributed representation generation algorithm.

During training,
The training data consists of a video and a series of dialogue history including multiple sets of utterances and response sentences between the people viewing the video,
multimodal information extraction means for extracting an image from the video for each set of uttered sentences and response sentences in the dialogue history;
Functions as a response sentence estimation engine that trains each set of utterance sentences and response sentences to input the utterance sentences to the encoder side and output the response sentences from the decoder side,
When estimating,
The response sentence estimation engine inputs the user's utterance, outputs multiple candidate response sentences,
Claims 1 to 6, characterized in that the computer functions as a response sentence reranking means for selecting a response sentence that includes or is similar to the result word output by the image learning engine from among a plurality of candidate response sentences. The program described in any one of the above.

8. The program according to claim 7, wherein the response sentence estimation engine causes the computer to function as Seq2Seq capable of extracting features between a general-purpose utterance sentence and a response sentence.

An estimation device for estimating causal relation words of a cause word and a result word from an image,
As training data, images are associated with subtitles associated with the images.
Regarding the training stage,
a causal relationship learning engine that inputs feature vectors of subtitle sentences and estimates cause and effect words from subtitle sentences that are estimated to have a causal relationship;
and an image learning engine that trains to input image feature vectors and output cause and effect words estimated by the causal relationship learning engine,
Regarding the estimation stage,
The image learning engine is an estimation device characterized in that it inputs an image as target data and outputs cause words and result words.

An estimation method of a device for estimating causal relation words of a cause word and a result word from an image, the method comprising:
As training data, images are associated with subtitles associated with the images.
The device is
Regarding the training stage,
a causal relationship learning engine that inputs feature vectors of subtitle sentences and estimates cause and effect words from subtitle sentences that are estimated to have a causal relationship;
and an image learning engine that trains to input image feature vectors and output cause and effect words estimated by the causal relationship learning engine,
Regarding the estimation stage,
An estimation method for an apparatus characterized in that the image learning engine inputs an image as target data and outputs a cause word and a result word.