JP7457332B2

JP7457332B2 - Tree structure estimation device, parameter learning device, tree structure estimation method, parameter learning method, and program

Info

Publication number: JP7457332B2
Application number: JP2021035375A
Authority: JP
Inventors: 努平尾; 昌明永田; 健司福島; 英剛上垣外; 学奥村
Original assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Current assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2024-03-28
Anticipated expiration: 2041-03-05
Also published as: JP2022135518A

Description

本発明は、計算機を用いて自動的に動画を処理するコンピュータビジョン分野と自動的にテキストを処理する自然言語処理分野に関連し、特に、動画をイベント（シーン）に分割してキャプションを与え、それらの関係を木構造として表す技術に関連するものである。 The present invention relates to the fields of computer vision, which automatically processes video using a computer, and natural language processing, which automatically processes text, and in particular to a technology that divides video into events (scenes), gives them captions, and represents the relationships between them as a tree structure.

自然言語処理分野では、文書全体を木構造として表現する談話構造解析技術が開発されている。特に、文書を修辞構造理論に基づいた木構造として表す技術が開発されている。 In the field of natural language processing, discourse structure analysis technology has been developed that represents an entire document as a tree structure. In particular, technology has been developed that represents a document as a tree structure based on rhetorical structure theory.

修辞構造はテキストの話題構造を木として表現したものであるが、こうした構造はテキストだけではなく動画にも存在する。 Rhetorical structure is a tree representation of the topic structure of a text, and this structure exists not only in text but also in videos.

つまり、葉がイベント区間（シーン）とそのキャプションを表し、ノードがスパン（シーン系列）の核性役割を表し、エッジがスパン間の修辞関係を表す木として動画を表現することができる。ただし、テキストの場合とは異なり木の葉は、動画の区間とキャプション文なので、文内の構造は考えなくて良い。 In other words, videos can be represented as a tree in which leaves represent event segments (scenes) and their captions, nodes represent the nuclear roles of spans (scene sequences), and edges represent rhetorical relationships between spans. However, unlike in the case of text, the leaves of the tree are video segments and caption sentences, so there is no need to consider the structure within the sentences.

こうした構造を得るために、非特許文献１に開示された技術では、動画に対してキャプション文を生成し、得たキャプションに対して従来の修辞構造解析技術を適用し木構造を得る。そして、キャプション文と動画フレームとを、ＬＳＴＭを用いて対応付けている。 In order to obtain such a structure, in the technique disclosed in Non-Patent Document 1, a caption sentence is generated for a video, and a conventional rhetorical structure analysis technique is applied to the obtained caption to obtain a tree structure. Then, caption sentences and video frames are associated using LSTM.

Arjun R Akula, Song-Chun Zhu: Visual Discourse Parsing, ,CVPR 2019 Workshop on Language and Vision, (2019)Arjun R Akula, Song-Chun Zhu: Visual Discourse Parsing, ,CVPR 2019 Workshop on Language and Vision, (2019)

従来の技術において修辞構造木はテキスト情報のみを用いて構築される。しかし、動画を修辞構造木として表す場合において、テキストがシーンに対応する動画区間の全てを書き尽くしているとは限らないため、シーン間の構造や関係を決定するためにテキストを利用するだけでは十分とは限らない。特に木構造を決定する際にはシーン間の類似性が重要な要素となるがテキストだけでは類似性をうまくとらえることができないことが多々ある。 In the prior art, rhetorical structure trees are constructed using only textual information. However, when representing a video as a rhetorical structure tree, the text does not necessarily cover the entire video section corresponding to the scene, so it is not possible to simply use the text to determine the structure and relationships between scenes. Not necessarily enough. In particular, when determining a tree structure, the similarity between scenes is an important factor, but it is often difficult to accurately capture the similarity using text alone.

本発明は上記の点に鑑みてなされたものであり、動画から、シーン間の類似性を適切に反映した修辞構造木を生成するための技術を提供することを目的とする。 The present invention has been made in view of the above points, and it is an object of the present invention to provide a technique for generating a rhetorical structure tree that appropriately reflects the similarity between scenes from a video.

開示の技術によれば、入力された動画から、前記動画の特徴量である動画ベクトルと、前記動画に付されたテキストの特徴量であるテキストベクトルとを抽出する特徴抽出部と、
前記動画ベクトルと前記テキストベクトルとを合成して合成ベクトルを生成するベクトル合成部と、
前記合成ベクトルから、前記動画の修辞構造木を表す木構造とラベルを推定する木構造推定部と
を備える木構造推定装置が提供される。 According to the disclosed technology, a feature extraction unit extracts, from an input video, a video vector that is a feature amount of the video and a text vector that is a feature amount of a text attached to the video;
a vector synthesis unit that generates a composite vector by synthesizing the video vector and the text vector;
A tree structure estimating device is provided, comprising: a tree structure representing a rhetorical structure tree of the video and a tree structure estimator that estimates a label from the composite vector.

開示の技術によれば、動画から、シーン間の類似性を適切に反映した修辞構造木を生成するための技術が提供される。 According to the disclosed technology, a technology for generating a rhetorical structure tree that appropriately reflects the similarity between scenes from a video is provided.

修辞構造木の例を示す図である。FIG. 2 is a diagram showing an example of a rhetorical structure tree. 動画に適用した修辞構造木の例を示す図である。FIG. 3 is a diagram showing an example of a rhetorical structure tree applied to a video. 動画談話構造解析装置の構成図である。FIG. 1 is a configuration diagram of a video discourse structure analysis device. データ入力部の構成図である。FIG. 3 is a configuration diagram of a data input section. パラメタ学習部の構成図である。FIG. 3 is a configuration diagram of a parameter learning section. パラメタ学習部の動作概要を示す図である。FIG. 3 is a diagram showing an outline of the operation of a parameter learning section. 木構造推定部の構成図である。It is a block diagram of a tree structure estimation part. 装置のハードウェア構成例を示す図である。It is a diagram showing an example of the hardware configuration of the device.

以下、図面を参照して本発明の実施の形態（本実施の形態）を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。 DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention (this embodiment) will be described below with reference to the drawings. The embodiments described below are merely examples, and embodiments to which the present invention is applied are not limited to the following embodiments.

（修辞構造木について）
まず、従来の修辞構造木の例を説明する。修辞構造理論では、文書は２分木（修辞構造木）として表現される。修辞構造木は、それを構成する最小の談話基本単位であるＥＤＵの系列（以降、スパンと呼ぶ）を修辞関係により結合し、より大きなスパンを構成するという操作を再帰的に繰り返すことによって得られる木である。 (About rhetorical structure trees)
First, an example of a conventional rhetorical structure tree will be explained. In rhetorical structure theory, a document is expressed as a binary tree (rhetorical structure tree). A rhetorical structure tree is obtained by recursively repeating the operation of connecting a series of EDUs (hereinafter referred to as spans), which are the smallest basic units of discourse that make up the tree, through rhetorical relationships to form larger spans. It's a tree.

木の葉はＥＤＵ（節に相当）のユニットであり、木のノードにはそれが支配するスパンの核性ラベルが付与される。結合される２つのスパン（兄弟スパン）の一方は重要な情報を持つ核となり、もう一方はそれを補足する衛星となる。例外的に双方が核となる場合もある。木の枝にはスパン間の修辞関係を表す関係ラベルが付与される。修辞関係を表す関係ラベルは、１８種類が定義されている。 A leaf of a tree is a unit of EDU (corresponding to a node), and a node of the tree is given a nuclearity label of the span it dominates. One of the two spans (sibling spans) that will be combined will become a core containing important information, and the other will become a satellite that supplements it. In exceptional cases, both sides may be core. Tree branches are given relational labels that represent the rhetorical relationships between spans. Eighteen types of relational labels representing rhetorical relations are defined.

図１に修辞構造木の例を示す。図中のｅ_１～ｅ_７がそれぞれＥＤＵであり、Ｓ／Ｎがスパンの核性ラベル（Ｎが核でＳが衛星）、Ｃｏｎｄｉｔｉｏｎ、Ｅｌａｂｏｒａｔｉｏｎなどが兄弟スパンの間の関係ラベルである。関係ラベルは兄弟スパンの核性がＳとＮの組合せの場合、Ｓ側のスパンに対して与えられ、ＮとＮになる場合には双方のスパンに対して与えられる。ＣｏｎｄｉｔｉｏｎやＥｌａｂｏｒａｔｉｏｎはＳとＮの組合せに対して与えられ、Ｌｉｓｔ、Ｓａｍｅ－ＵｎｉｔはＮとＮの組合せに与えられる。 Figure 1 shows an example of a rhetorical structure tree. In the figure, e ₁ to e ₇ are EDUs, S/N is the nuclearity label of the span (N is the nucleus, S is the satellite), Condition, Elaboration, etc. are the relationship labels between the sibling spans. When the nuclearity of the sibling spans is a combination of S and N, the relationship label is given to the span on the S side, and when it is N and N, it is given to both spans. Condition and Elaboration are given to the combination of S and N, and List and Same-Unit are given to the combination of N and N.

（実施の形態の概要）
前述したとおり、修辞構造はテキストだけではなく動画にも存在し、葉がイベント（シーン）とそのキャプションを表し、ノードがスパン（シーン系列）の核性役割を表し、エッジがスパン間の修辞関係を表す木として動画を表現することができる。例えば、図１に示す修辞構造木における葉（ｅ）をシーンとキャプションのタプルに置き換えた修辞構造木により、動画の修辞構造とスパン間の関係が表される。 (Overview of the embodiment)
As mentioned above, rhetorical structures exist not only in texts but also in videos, and videos can be represented as trees in which leaves represent events (scenes) and their captions, nodes represent the nuclear roles of spans (scene sequences), and edges represent the rhetorical relationships between spans. For example, the rhetorical structure of a video and the relationships between spans can be represented by a rhetorical structure tree in which the leaves (e) in the rhetorical structure tree shown in Figure 1 are replaced with tuples of scenes and captions.

図２に、動画を木構造とラベルにより修辞構造木として表した例を示す。図２に示す例では、「［」，「］」の中の数字によりシーンが表す動画の開始、終了時刻を表し、ｃが、シーンに対応するキャプション文を表す。 FIG. 2 shows an example in which a video is represented as a rhetorical structure tree using a tree structure and labels. In the example shown in FIG. 2, the numbers in "[" and "]" represent the start and end times of the video represented by the scene, and c represents the caption text corresponding to the scene.

前述したように、非特許文献１に開示された従来技術では、シーン間の構造や関係を決定するのにテキストのみを利用してるため、シーン間の類似性をうまくとらえることができないことが多々ある。そこで、本実施の形態では、テキスト情報だけでなく画像情報も合わせてシーンの修辞構造推定、関係推定を行うことにより、シーン間の類似性を適切に反映した修辞構造木を生成することを可能としている。 As mentioned above, the conventional technology disclosed in Non-Patent Document 1 uses only text to determine the structure and relationships between scenes, and therefore often fails to capture the similarities between scenes well. Therefore, in this embodiment, the rhetorical structure and relationship estimation of scenes is performed using not only text information but also image information, making it possible to generate a rhetorical structure tree that appropriately reflects the similarities between scenes.

（装置構成例、動作概要）
図３に、本実施の形態における動画談話構造解析装置１００の構成例を示す。図３に示すように動画談話構造解析装置１００は、データ入力部１１０、木構造推定部１２０、パラメタ学習部１３０、データ出力部１４０を備える。 (Device configuration example, operation overview)
FIG. 3 shows a configuration example of the video discourse structure analysis device 100 in this embodiment. As shown in FIG. 3, the video discourse structure analysis device 100 includes a data input section 110, a tree structure estimation section 120, a parameter learning section 130, and a data output section 140.

動画談話構造解析装置１００は、１つのコンピュータで実装されてもよいし、複数のコンピュータで実装されてもよい。また、動画談話構造解析装置１００のうちの一部又は全部の機能が、クラウド上の仮想マシンで実装されてもよい。データ入力部１１０、木構造推定部１２０、パラメタ学習部１３０、データ出力部１４０がそれぞれ別装置で実装されてもよく、これらをそれぞれ、データ入力装置、木構造推定装置、パラメタ学習装置、データ出力装置と呼んでもよい。また、「データ入力部１１０＋木構造推定部１２０＋データ出力部１４０」を木構造推定装置と呼んでもよい。 The video discourse structure analysis device 100 may be implemented on one computer or on multiple computers. Furthermore, some or all of the functions of the video discourse structure analysis device 100 may be implemented in a virtual machine on the cloud. The data input unit 110, the tree structure estimation unit 120, the parameter learning unit 130, and the data output unit 140 may be implemented as separate devices. It can also be called a device. Furthermore, "data input unit 110 + tree structure estimation unit 120 + data output unit 140" may be referred to as a tree structure estimation device.

図３には、処理の流れも示されている。図３に示すように、データ入力部１１０は、動画を受け取り、動画からスパンベクトルを生成し、生成したスパンベクトルを木構造推定部１２０へ渡す。 The process flow is also shown in FIG. 3. As shown in FIG. 3, the data input unit 110 receives a video, generates a span vector from the video, and passes the generated span vector to the tree structure estimation unit 120.

パラメタ学習部１３０は、図２に示したアノテーション、つまりシーンに対するアノテーション（動画区間時間とキャプション）及び修辞構造に対するアノテーション（木構造とラベル）がなされた動画を受け取り、当該アノテーション済みの動画に基づいて、ニューラルネットワークによる修辞木構造推定、核性ラベル推定、関係ラベル推定のためのパラメタを学習し、学習済みのパラメタを木構造推定部１２０へ渡す。アノテーションとして与えられる木構造とラベルは、パラメタ学習の際の正解データとして用いられる。 The parameter learning unit 130 receives the video with the annotations shown in FIG. 2, that is, the annotations for the scene (video section time and caption) and the annotation for the rhetorical structure (tree structure and labels), and performs annotation based on the annotated video. , learns parameters for rhetorical tree structure estimation, coreity label estimation, and relational label estimation using a neural network, and passes the learned parameters to the tree structure estimation unit 120. The tree structure and labels given as annotations are used as correct answer data during parameter learning.

木構造推定部１２０は、データ入力部１１からスパンベクトルを受け取り、パラメタ学習部１３０からパラメタを受け取り、これらを用いてシーンの分割点、及びシーンのラベルを推定し、推定したシーンの分割点、及びシーンのラベルをデータ出力部１４０へ渡す。 The tree structure estimation unit 120 receives the span vector from the data input unit 11, receives the parameters from the parameter learning unit 130, uses these to estimate scene division points and scene labels, and estimates the scene division points, and passes the scene label to the data output unit 140.

データ出力部１４０は、木構造推定１２０から受け取ったシーンの分割点、及びラベルを受け取り、例えばＳ式を用いて木を出力する。 The data output unit 140 receives the scene segmentation points and labels from the tree structure estimation unit 120, and outputs the tree using, for example, an S-expression.

なお、本実施の形態において、データ入力部１１０、木構造推定部１２０、パラメタ学習部１３０、データ出力部１４０はいずれもニューラルネットワークにより構成されるものとする。以下、各部の構成と処理内容を詳細に説明する。 In this embodiment, it is assumed that the data input section 110, the tree structure estimation section 120, the parameter learning section 130, and the data output section 140 are all configured by a neural network. The configuration and processing contents of each part will be explained in detail below.

（データ入力部１１０）
データ入力部１１０は、ラベル付き木の生成対象である動画を入力として受け取り、シーン系列に対応するスパンベクトルを木構造推定装置１２０に渡す。図４に、データ入力部１１０の機能構成を示す。図４に示すように、データ入力部１１０は、キャプション生成部１１１とスパンベクトル生成部１１２を備える。各部の処理内容は下記のとおりである。 (Data Input Unit 110)
The data input unit 110 receives as input a video for which a labeled tree is to be generated, and passes a span vector corresponding to a scene sequence to the tree structure estimation device 120. Fig. 4 shows the functional configuration of the data input unit 110. As shown in Fig. 4, the data input unit 110 includes a caption generation unit 111 and a span vector generation unit 112. The processing contents of each unit are as follows.

＜キャプション生成部１１１＞
キャプション生成部１１１は、ビデオキャプション技術を用いて、入力された動画中の各シーンを同定して、同定したシーンのキャプションを生成する。つまり、各シーンの開始終了時刻とキャプションを与える。なお、ビデオキャプションには既存の技術を利用すれば良い。 <Caption generation unit 111>
The caption generation unit 111 uses video caption technology to identify each scene in the input video and generates a caption for the identified scene. In other words, it gives the start and end times and captions for each scene. Note that existing technology may be used for video captioning.

＜スパンベクトル生成部１１２＞
スパンベクトル生成部１１２は、後述するパラメタ学習部１３０が備えるものと同じ動画特徴抽出部、テキスト特徴抽出部、ベクトル合成部を有する。動画特徴抽出部、テキスト特徴抽出部、及びベクトル合成部の詳細についてはパラメタ学習部１３０の説明の際に説明する。動画特徴抽出部、テキスト特徴抽出部、ベクトル合成部のそれぞれのパラメタについては、パラメタ学習部１３０で最適化されたパラメタが使用される。 <Span vector generation unit 112>
The span vector generation section 112 has the same video feature extraction section, text feature extraction section, and vector synthesis section as those included in the parameter learning section 130, which will be described later. Details of the video feature extraction section, text feature extraction section, and vector synthesis section will be explained when the parameter learning section 130 is explained. The parameters optimized by the parameter learning unit 130 are used for the parameters of the video feature extraction unit, text feature extraction unit, and vector synthesis unit.

キャプション生成部１１１により生成されたシーンの開始時刻、終了時刻、及び当該シーンに対するキャプション文が、動画特徴抽出部とテキスト特徴抽出部のそれぞれへ入力される。動画特徴抽出部とテキスト特徴抽出部のそれぞれが出力した特徴ベクトルをベクトル合成部へ渡し、ベクトル合成部によりスパンベクトルを生成する。 The start time and end time of the scene generated by the caption generation unit 111 and the caption text for the scene are input to each of the video feature extraction unit and the text feature extraction unit. The feature vectors output by each of the video feature extraction section and the text feature extraction section are passed to the vector synthesis section, and the vector synthesis section generates a span vector.

（パラメタ学習部１３０）
図５に、パラメタ学習部１３０の機能構成を示す。図５に示すように、パラメタ学習部１３０は、特徴量抽出部１３１、ベクトル合成部１３４、木構造推定処理部１３５、ラベル推定部１３６、パラメタ最適化部１３７を有する。特徴量抽出部１３１は、動画特徴抽出部１３２、テキスト特徴抽出部１３３を有する。 (Parameter Learning Unit 130)
Fig. 5 shows the functional configuration of the parameter learning unit 130. As shown in Fig. 5, the parameter learning unit 130 has a feature extraction unit 131, a vector synthesis unit 134, a tree structure estimation processing unit 135, a label estimation unit 136, and a parameter optimization unit 137. The feature extraction unit 131 has a video feature extraction unit 132 and a text feature extraction unit 133.

図６は、パラメタ学習部１３０の動作概要を示している。動画特徴抽出部１３２とテキスト特徴抽出部１３３のそれぞれに、アノテーション済み動画が与えられる。アノテーション済み動画とは、動画におけるシーンに関するデータ（開始・終了時刻及びキャプション）と、スパン（シーン系列）の分割点と、スパンの核性ラベルと、スパン間の修辞関係ラベルとが、アノテーションとして付された動画である。 FIG. 6 shows an outline of the operation of the parameter learning section 130. The annotated video is provided to each of the video feature extraction unit 132 and the text feature extraction unit 133. An annotated video is annotated with data about scenes in the video (start/end times and captions), division points of spans (scene series), coreity labels of spans, and labels of rhetorical relationships between spans. This is a video.

動画特徴抽出部１３２から出力された動画ベクトルとテキスト特徴抽出部１３３から出力されたテキストベクトルがベクトル合成部１３４に入力され、ベクトル合成部１３４が、これらを合成してスパンベクトルを生成する。スパンベクトルに基づいて、木構造推定処理部１３５、ラベル推定部１３６、及びパラメタ最適化部１３７により、ニューラルネットワークにおける、木構造、核性、及び関係を推定するためのパタメタを出力する。以下、各部の処理内容を詳細に説明する。 The video vector output from the video feature extraction section 132 and the text vector output from the text feature extraction section 133 are input to the vector synthesis section 134, and the vector synthesis section 134 synthesizes them to generate a span vector. Based on the span vector, the tree structure estimation processing unit 135, label estimation unit 136, and parameter optimization unit 137 output parameters for estimating the tree structure, coreness, and relationship in the neural network. The processing contents of each part will be explained in detail below.

＜動画特徴抽出部１３２＞
動画特徴抽出部１３２は、アノテーション済み動画から、各シーンに対応する動画ベクトルを、動画中のフレームに対するベクトル（例えば、Ｃ３Ｄ、Ｉ３Ｄなどの手法で得た各フレームに対する特徴ベクトル）とＬＳＴＭを利用することで生成する。 <Video feature extraction unit 132>
The video feature extraction unit 132 extracts video vectors corresponding to each scene from the annotated video using vectors for frames in the video (for example, feature vectors for each frame obtained by a method such as C3D or I3D) and LSTM. It is generated by

例えば、あるシーンの開始時刻が０：１０であり終了時刻が１：００であるならば、その区間が支配する全てのフレームに対応するベクトルを前向きＬＳＴＭ、及び後ろ向きＬＳＴＭに入力する。 For example, if the start time of a certain scene is 0:10 and the end time is 1:00, vectors corresponding to all frames dominated by that section are input to the forward LSTM and backward LSTM.

あるシーンが０：１０から１：００であったとき、その区間に含まれるフレーム数がｎ個であるとして、ｊ番目のフレームに対応するベクトル（Ｃ３Ｄなどで得られた特徴ベクトル）をｖ_ｊとする。そして、前向きＬＳＴＭを^→ＬＳＴＭ_ｆ、後ろ向きＬＳＴＭを^←ＬＳＴＭ_ｆとする。なお、本明細書での記載の便宜上、本明細書のテキストにおいて、頭の上に矢印線を記載したＬＳＴＭを「^→ＬＳＴＭ_ｆ」、「^←ＬＳＴＭ_ｆ」のように記載している。他の文字も同様に「^→」、「^←」を使用する。 When a scene is from 0:10 to 1:00, the number of frames included in that section is n, and the vector (feature vector obtained by C3D or the like) corresponding to the j-th frame is denoted as _vj . The forward LSTM is denoted as ^→ _LSTMf , and the backward LSTM is denoted as ^← _LSTMf . For convenience of description in this specification, LSTMs with an arrow above them are written as " ^→ _LSTMf " and " ^← _LSTMf " in the text of this specification. Similarly, " ^→ " and " ^← " are used for other characters.

ここでｊ番目のフレームに対する前向き、後ろ向きの隠れ状態を用いてｊ番目のフレームの隠れ状態ｈ^ｖ _ｊを以下の式で表す。なお、［^→ｈ^ｖ _ｊ；^←ｈ^ｖ _ｊ］は、^→ｈ^ｖ _ｊと^←ｈ^ｖ _ｊの連結を示す。 Here, using the forward and backward hidden states for the j-th frame, the hidden state h ^v _j of the j-th frame is expressed by the following equation. Note that [ ^→ h ^v _j ; ^← h ^v _j ] indicates the connection of ^→ h ^v _j and ^← h ^v _j .

そして、シーンに対応する動画区間のベクトルをＶ＝［ｈ^ｖ _１；ｈ^ｖ _ｎ］とする。動画特徴抽出部１３２は、各シーンに対する動画ベクトルを出力する。なお、［ｈ^ｖ _１；ｈ^ｖ _ｎ］は、ｈ^ｖ _１とｈ^ｖ _ｎの連結を示す。

Then, let the vector of the video section corresponding to the scene be V=[h ^v ₁ ; h ^v _n ]. The video feature extraction unit 132 outputs a video vector for each scene. Note that [h ^v ₁ ; h ^v _n ] indicates a connection between h ^v ₁ and h ^v _n .

＜テキスト特徴抽出部１３３＞
テキスト特徴抽出部１３３は、アノテーション済み動画から、シーンのキャプションに対応するテキストベクトルを、文に含まれる単語埋め込みベクトルとＬＳＴＭを用いて生成する。 <Text feature extraction unit 133>
The text feature extraction unit 133 generates a text vector corresponding to the scene caption from the annotated video using the word embedding vector included in the sentence and LSTM.

テキスト特徴抽出部１３３は、キャプションの文に含まれる全ての単語に対してその埋め込みベクトルを得た後、それを前向き、後ろ向きＬＳＴＭに入力する。動画特徴抽出部１３２と同様に、前向きＬＳＴＭによる隠れ状態、後ろ向きＬＳＴＭによる隠れ状態を用いてｊ番目の単語の隠れ状態ｈ^ｗ _ｊを以下の式で表す。 The text feature extraction unit 133 obtains embedding vectors for all words included in the caption sentence, and then inputs them to the forward and backward LSTMs. Similar to the video feature extraction unit 132, the hidden state h ^w _j of the j-th word is expressed by the following equation using the hidden state by forward LSTM and the hidden state by backward LSTM.

そして、ｋ単語からなる文全体のベクトル表現をＳ＝［ｈ^ｗ _１；ｈ^ｗ _ｋ］とする。なお、［ｈ^ｗ _１；ｈ^ｗ _ｋ］は、ｈ^ｗ _１とｈ^ｗ _ｋの連結を示す。テキスト特徴抽出部１３３は、各シーンのキャプションについてのテキストベクトルを出力する。なお、式（２）におけるｗは単語埋め込みベクトルである。

Then, let S=[h ^w ₁ ; h ^w _k ] be the vector representation of the entire sentence consisting of k words. Note that [h ^w ₁ ; h ^w _k ] indicates the connection of h ^w ₁ and h ^w _k . The text feature extraction unit 133 outputs a text vector for the caption of each scene. Note that w in equation (2) is a word embedding vector.

＜ベクトル合成部１３４＞
ベクトル合成部１３４では、まず、各シーンについて、シーンに対する動画ベクトルＶとそのキャプションに対応するテキストベクトルＳを合成し、シーンのベクトルを生成する。いま、シーンに対応するキャプションのｊ番目の単語の隠れ状態ｈ^ｗ _ｊ、動画ベクトルＶ、テキストベクトルＳに対し、選択的ゲートを用いて新たなｊ番目の単語の隠れ状態ｈ´^ｗ _ｊを以下の式で定義する。 <Vector synthesis unit 134>
The vector synthesis unit 134 first synthesizes, for each scene, a video vector V for the scene and a text vector S corresponding to its caption to generate a scene vector. Now, for the hidden state h ^w _j of the j-th word of the caption corresponding to the scene, the video vector V, and the text vector S, a new hidden state h ^{' w} _j of the j-th word is calculated as follows using a selective gate. Defined by the formula.

Ｗ_Ｓ、Ｕ_Ｓ、Ｕ_Ｖは重み行列であり、ｂ_Ｓはバイアスベクトルである。σはシグモイド関数を表す。〇の中に・を記載した記号はアダマール積を表す。そして、式（３）を用いて、シーンｉのシーンベクトルを以下の式（４）で定義する。

_Ws , _Us , and _Uv are weight matrices, and _bs is a bias vector. σ represents a sigmoid function. The symbol with a dot inside a circle represents the Hadamard product. Using equation (3), the scene vector of scene i is defined by the following equation (4).

次にこれを前向き、後ろ向きＬＳＴＭへ入力し以下の隠れ状態を得る。

Next, input this to the forward and backward LSTMs to obtain the following hidden state.

上記の式におけるｆ、ｂは、前向き、後ろ向きＬＳＴＭの隠れ状態を表す。最終的に、ｍ番目のシーンからｎ番目までのシーン系列に対応するスパンベクトルを以下の式で定義する。

f and b in the above equation represent the hidden states of the forward and backward LSTMs. Finally, the span vector corresponding to the scene series from the m-th scene to the n-th scene is defined by the following equation.

なお、シーンベクトルについては上記の手順をふまず、単純にＶとＳを結合するだけでも良い。つまり、以下の式を用いても良い。スパンベクトルを得る手続については上記と同様であり、式（５）、式（６）により得られる。

Note that for the scene vector, it is also possible to simply combine V and S without following the above procedure. In other words, the following formula may be used. The procedure for obtaining the span vector is the same as above, and can be obtained using equations (5) and (6).

＜木構造推定処理部１３５＞
木構造推定処理部１３５は、スパンの分割点を推定することで木構造を推定する。任意のスパン（ｉ番目のシーンからｊ番目のシーンからなるシーンの系列）に対しｋ番目のシーンでスパンが分割されるスコアｓ_{ｓｐｌｉｔ}（ｉ；ｊ；ｋ）を以下の式で与える。

<Tree structure estimation processing unit 135>
The tree structure estimation processing unit 135 estimates the tree structure by estimating the division points of the span. A score s _split (i; j; k) at which the span is divided at the k-th scene for an arbitrary span (a sequence of scenes consisting of the i-th scene to the j-th scene) is given by the following formula.

ここで、Ｗ_ｕは重み行列であり、ｖ_ｌ（添字ｌはＬの小文字）とｖ_ｒはそれぞれ分割された左右のスパンに対する重みベクトルである。ｈ_ｉ：ｋとｈ_{ｋ＋１：ｊ}は以下で定義される。

where _Wu is a weight matrix, _vl (the subscript l is a lowercase L) and _vr are weight vectors for the divided left and right spans, respectively, _{and hi:k} and _hk+1:j are defined as follows:

ｈ_ｉ：ｋ＝ＭＬＰ_ｌｅｆｔ（ｕ_ｉ：ｋ），ｈ_{ｋ＋１：ｊ}＝ＭＬＰ_{ｒｉｇｈｔ}（ｕ_{ｋ＋１：ｊ}），
ＭＬＰ_＊は多層パーセプトロンを表す。スパンベクトルｕ_ｉ：ｊはベクトル合成部１３４により得られたベクトルである。スパンは、下記の式のとおり、式（８）を最大にするｋにて分割される。 h _i:k = MLP _left (u _i:k ), h _k+1:j = MLP _right (u _k+1:j ),
MLP _* stands for multilayer perceptron. The span vector u _i:j is a vector obtained by the vector synthesis unit 134. The span is divided by k, which maximizes equation (8), as shown in the equation below.

＜ラベル推定部１３６＞
ラベル推定部１３６は、木構造推定処理部１３６が決定したスパンの分割点ｋに対し、分割した２つのスパンに対する核性ラベル、修辞関係ラベルを予測する。予測のスコアは以下の式で与えられる。

<Label estimation unit 136>
The label estimating unit 136 predicts a core label and a rhetorical relationship label for the two divided spans at the span dividing point k determined by the tree structure estimation processing unit 136. The prediction score is given by the following formula:

上記の式におけるＷ_ｌ（添字ｌはＬの小文字）は重み行列であり、ｕ_１：ｉ；ｕ_ｊ：ｎはそれぞれｉ番目のシーンの左側のスパンベクトル、ｊ番目のシーンの右側のスパンベクトルである。最終的に、以下の式で式（１０）を最大にするラベルを与える。

In the above equation, W _l (the subscript l is a lowercase letter L) is a weight matrix, and u _1:i ; u _j:n are the left span vector of the i-th scene and the right span vector of the j-th scene, respectively. It is. Finally, the following formula gives a label that maximizes formula (10).

Ｌは、ラベル集合であり核性ラベルを付与する場合には３種のラベルからなる集合｛Ｎ－Ｓ，Ｓ－Ｎ，Ｎ－Ｎ｝となり、修辞関係ラベルを付与する場合には以下の１８種のラベルからなる集合となる。ラベル：ｅｌａｂｏｒａｔｉｏｎ、ｊｏｉｎｔ、ａｔｔｒｉｂｕｔｉｏｎ、ｓａｍｅ－ｕｎｉｔ、ｃｏｎｔｒａｓｔ、ｔｅｍｐｏｒａｌ、ｂａｃｋｇｒｏｕｎｄ、ｅｘｐｌａｎａｔｉｏｎ、ｃａｕｓｅ、ｅｖａｌｕａｔｉｏｎ、ｃｏｎｄｉｔｉｏｎ、ｅｎａｂｌｅｍｅｎｔ、ｔｏｐｉｃ－ｃｏｍｍｅｎｔ、ｃｏｍｐａｒｉｓｏｎ、ｓｕｍｍａｒｙ、ｍａｎｎｅｒ－ｍｅａｎｓ、ｔｏｐｉｃ－ｃｈａｎｇｅ、ｔｅｘｕａｌ－ｏｒｇａｎｉｚａｔｉｏｎ。

L is a label set, and when nuclear labels are assigned, it becomes a set of three types of labels {N-S, SN, NN}, and when rhetorical relation labels are assigned, it becomes a set of the following 18 types of labels: Labels: elaboration, joint, attribute, same-unit, contrast, temporal, background, explanation, cause, evaluation, condition, enablement, topical-comment, comparison, summary, manner-means, topical-change, textual-organization.

なお、Ｗ_ｌとＭＬＰは核性ラベルを与える場合と修辞ラベルを与える場合とでは独立に最適化する。 Note that W _l and MLP are optimized independently in the case of giving a nuclear label and the case of giving a rhetorical label.

＜パラメタ最適化部１３７＞
パラメタ最適化部１３７は、学習対象の全てのパラメタ、すなわち、Ｗ_Ｓ、Ｕ_Ｓ、Ｕ_Ｖ、Ｗ_ｕ、Ｗ_ｌ、ｖ_ｒ、ｖ_ｌ、ＬＳＴＭ、及びＭＬＰのパラメタを、以下に定義する２つの損失関数の和を最小化することで得る。なお、ｋ^＊とｌ^＊（ｌはＬの小文字）はそれぞれ正解の分割位置、ラベルである。正解の分割位置とラベルは、入力されたアノテーション済み動画のアノテーションかた得られる。損失関数を最小化する演算については、誤差逆伝搬法等の既存手法を用いて行うことができる。 <Parameter Optimization Unit 137>
The parameter optimization unit 137 obtains all parameters to be learned, i.e., _Ws , _US , _UV , _Wu , _Wl , _vr , _vl , LSTM, and MLP parameters, by minimizing the sum of two loss functions defined below. Note that k ^* and l ^* (l is a lowercase L) are the correct division position and label, respectively. The correct division position and label can be obtained from the annotation of the input annotated video. The calculation to minimize the loss function can be performed using an existing method such as backpropagation.

（木構造推定部１２０）
木構造推定部１２０は、パラメタ学習部１３０が出力するパラメタとデータ入力部１１２が出力するスパンベクトルを用いて木構造を推定する。図７に、木構造推定部１２０の機能構成を示す。図７に示すように、木構造推定部１２０は、木構造推定処理部１２１とラベル推定部１２２を備える。以下、各部について説明する。

(Tree structure estimation unit 120)
The tree structure estimating unit 120 estimates a tree structure using the parameters output by the parameter learning unit 130 and the span vectors output by the data input unit 112. FIG. 7 shows the functional configuration of the tree structure estimation section 120. As shown in FIG. 7, the tree structure estimation section 120 includes a tree structure estimation processing section 121 and a label estimation section 122. Each part will be explained below.

＜木構造推定処理部１２１＞
木構造推定処理部１２１は、パラメタ学習部１３０における木構造推定処理部１３５と同じものである。木構造推定処理部１２１は初期状態として動画全体に対応するスパンベクトルを入力として受け取り、これを再帰的に２分割することで木構造を得る。シーン数がｍである場合、パラメタ学習部１３０が決定したパラメタを使用した式（８）においてｉ＝１；ｊ＝ｍとして、式（９）で分割点を決定する。これを再帰的に繰り返す。 <Tree structure estimation processing unit 121>
The tree structure estimation processing section 121 is the same as the tree structure estimation processing section 135 in the parameter learning section 130. The tree structure estimation processing unit 121 receives as input a span vector corresponding to the entire video as an initial state, and obtains a tree structure by recursively dividing this into two. When the number of scenes is m, dividing points are determined by equation (9) by setting i=1; j=m in equation (8) using the parameters determined by the parameter learning unit 130. Repeat this recursively.

例えば、対象とする動画が、図２に示した木構造になることを想定した場合、まず、ｃ_１～ｃ_３のスパンとｃ_４～ｃ_６のスパンに分割され、分割された各スパンが図示のとおりに分割されていく。 For example, assuming that the target video has the tree structure shown in Figure 2, it is first divided into spans c ₁ to c ₃ and spans c ₄ to c ₆ , and each divided span is It is divided as shown.

＜ラベル推定部１２２＞
ラベル推定部１２２もパラメタ学習部１３０のラベル推定部１３６と同じものである。ラベル推定部１２２は、木構造推定処理部１２１にて２分割された２つのスパンベクトルを受け取り、核性ラベルと関係ラベルのそれぞれを推定する。核性ラベルの推定時にはＮ－Ｓ、Ｓ－Ｎ、Ｎ－Ｎのいずれかに分類し、修辞ラベルの推定時には１８種のラベルのいずれかに分類する。 <Label estimation unit 122>
The label estimation unit 122 is also the same as the label estimation unit 136 of the parameter learning unit 130. The label estimation unit 122 receives the two span vectors divided into two by the tree structure estimation processing unit 121, and estimates a core label and a relational label, respectively. When estimating a nuclear label, it is classified into one of NS, SN, and N-N, and when estimating a rhetorical label, it is classified into one of 18 types of labels.

（データ出力部１４０）
データ出力部１４０は、木構造推定処理部１２１が推定したスパンの分割点、及びラベル推定部１２２が出力したスパンのラベル情報をまとめ、ラベル付き木として、例えばＳ式として出力する。 (Data output unit 140)
The data output unit 140 summarizes the span dividing points estimated by the tree structure estimation processing unit 121 and the label information of the span outputted by the label estimation unit 122, and outputs it as a labeled tree, for example, as an S expression.

（装置のハードウェア構成例）
動画談話構造解析装置１００は、データ入力部１１０、木構造推定部１２０、パラメタ学習部１３０、データ出力部１４０、データ入力装置、木構造推定装置、パラメタ学習装置、データ出力装置（これらを総称して「装置」と呼ぶ）はいずれも、例えば、コンピュータに、本実施の形態で説明する処理内容を記述したプログラムを実行させることにより実現可能である。 (Example of device hardware configuration)
The video discourse structure analysis device 100 includes a data input unit 110, a tree structure estimation unit 120, a parameter learning unit 130, a data output unit 140, a data input device, a tree structure estimation device, a parameter learning device, and a data output device (these are collectively referred to as (hereinafter referred to as "device") can be realized, for example, by causing a computer to execute a program that describes the processing contents described in this embodiment.

上記プログラムは、コンピュータが読み取り可能な記録媒体（可搬メモリ等）に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 The above program can be recorded on a computer-readable recording medium (such as a portable memory) and can be stored or distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.

図８は、上記コンピュータのハードウェア構成例を示す図である。図８のコンピュータは、それぞれバスＢＳで相互に接続されているドライブ装置１０００、補助記憶装置１００２、メモリ装置１００３、ＣＰＵ１００４、インタフェース装置１００５、表示装置１００６、入力装置１００７、出力装置１００８等を有する。 Figure 8 is a diagram showing an example of the hardware configuration of the computer. The computer in Figure 8 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., all of which are interconnected by a bus BS.

当該コンピュータでの処理を実現するプログラムは、例えば、ＣＤ－ＲＯＭ又はメモリカード等の記録媒体１００１によって提供される。プログラムを記憶した記録媒体１００１がドライブ装置１０００にセットされると、プログラムが記録媒体１００１からドライブ装置１０００を介して補助記憶装置１００２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１００１より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置１００２は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 A program for realizing processing by the computer is provided, for example, by a recording medium 1001 such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000. However, the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via a network. The auxiliary storage device 1002 stores installed programs as well as necessary files, data, and the like.

メモリ装置１００３は、プログラムの起動指示があった場合に、補助記憶装置１００２からプログラムを読み出して格納する。ＣＰＵ１００４は、メモリ装置１００３に格納されたプログラムに従って、当該装置に係る機能を実現する。インタフェース装置１００５は、ネットワークに接続するためのインタフェースとして用いられる。表示装置１００６はプログラムによるＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）等を表示する。入力装置１００７はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。出力装置１００８は演算結果を出力する。 The memory device 1003 reads the program from the auxiliary storage device 1002 and stores it when there is an instruction to start the program. The CPU 1004 implements functions related to the device according to programs stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network. A display device 1006 displays a GUI (Graphical User Interface) or the like based on a program. The input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used to input various operation instructions. An output device 1008 outputs the calculation result.

（実施の形態の効果）
以上説明したとおり、本実施の形態では、動画を入力として、シーンを葉、シーン間の関係を核性、関係ラベルで表現する修辞構造木を出力できる。特に、テキスト情報だけでなく画像情報も合わせてシーンの修辞構造推定、関係推定を行うこととしたので、動画から、シーン間の類似性を適切に反映した修辞構造木を生成することができる。 (Effects of the embodiment)
As described above, in this embodiment, a video is input and a rhetorical structure tree can be output in which scenes are represented as leaves and relationships between scenes are represented as kernels and relationship labels. In particular, since the rhetorical structure and relationship inference of scenes is performed not only on the basis of text information but also on the basis of image information, a rhetorical structure tree that appropriately reflects the similarity between scenes can be generated from a video.

（実施の形態のまとめ）
本明細書には、少なくとも下記各項の木構造推定装置、パラメタ学習装置、木構造推定方法、パラメタ学習方法、及びプログラムが開示されている。
（第１項）
入力された動画から、前記動画の特徴量である動画ベクトルと、前記動画に付されたテキストの特徴量であるテキストベクトルとを抽出する特徴抽出部と、
前記動画ベクトルと前記テキストベクトルとを合成して合成ベクトルを生成するベクトル合成部と、
前記合成ベクトルから、前記動画の修辞構造木を表す木構造とラベルを推定する木構造推定部と
を備える木構造推定装置。
（第２項）
前記修辞構造木の葉が前記動画のシーンと、そのシーンに対するキャプションに対応し、ノードのラベルがシーン系列の核性に対応し、エッジのラベルがシーン系列間の修辞関係に対応する
第１項に記載の木構造推定装置。
（第３項）
前記特徴抽出部は、シーンを構成する各フレームの特徴ベクトルをＬＳＴＭに入力することにより当該シーンに対応する動画ベクトルを作成する
第１項又は第２項に記載の木構造推定装置。
（第４項）
入力された動画から、前記動画の特徴量である動画ベクトルと、前記動画に付されたテキストの特徴量であるテキストベクトルとを抽出する特徴抽出部と、
前記動画ベクトルと前記テキストベクトルとを合成して合成ベクトルを生成するベクトル合成部と、
前記合成ベクトルに基づいて、前記動画の修辞構造木を表す木構造とラベルとを推定する木構造推定部と、
前記木構造推定部により推定された前記木構造と前記ラベル、及び、前記動画の修辞構造木を表す正解データを用いて、前記木構造推定部に対応するニューラルネットワークのパラメタを最適化するパラメタ最適化部と
を備えるパラメタ学習装置。
（第５項）
木構造推定装置が実行する木構造推定方法であって、
入力された動画から、前記動画の特徴量である動画ベクトルと、前記動画に付されたテキストの特徴量であるテキストベクトルとを抽出する特徴抽出ステップと、
前記動画ベクトルと前記テキストベクトルとを合成して合成ベクトルを生成するベクトル合成ステップと、
前記合成ベクトルから、前記動画の修辞構造木を表す木構造とラベルを推定する木構造推定ステップと
を備える木構造推定方法。
（第６項）
パラメタ学習装置が実行するパラメタ学習方法であって、
入力された動画から、前記動画の特徴量である動画ベクトルと、前記動画に付されたテキストの特徴量であるテキストベクトルとを抽出する特徴抽出ステップと、
前記動画ベクトルと前記テキストベクトルとを合成して合成ベクトルを生成するベクトル合成ステップと、
前記合成ベクトルに基づいて、前記動画の修辞構造木を表す木構造とラベルとを推定する木構造推定ステップと、
前記木構造推定ステップにより推定された前記木構造と前記ラベル、及び、前記動画の修辞構造木を表す正解データを用いて、前記木構造推定ステップを実行するニューラルネットワークのパラメタを最適化するパラメタ最適化ステップと
を備えるパラメタ学習方法。
（第７項）
コンピュータを、第１項ないし第３項のうちいずれか１項に記載の木構造推定装置における各部として機能させるためのプログラム。
（第８項）
コンピュータを、第４項に記載のパラメタ学習装置における各部として機能させるためのプログラム。 (Summary of embodiments)
This specification discloses at least the following items: a tree structure estimation device, a parameter learning device, a tree structure estimation method, a parameter learning method, and a program.
(Section 1)
a feature extraction unit that extracts, from an input video, a video vector that is a feature of the video and a text vector that is a feature of a text attached to the video;
a vector synthesis unit that generates a composite vector by synthesizing the video vector and the text vector;
A tree structure estimating device comprising: a tree structure estimation unit that estimates a tree structure representing a rhetorical structure tree of the video and a label from the composite vector.
(Section 2)
The leaves of the rhetorical structure tree correspond to scenes of the video and captions for the scenes, the labels of nodes correspond to the core nature of scene sequences, and the labels of edges correspond to rhetorical relationships between scene sequences. tree structure estimation device.
(Section 3)
The tree structure estimating device according to claim 1 or 2, wherein the feature extraction unit creates a video vector corresponding to the scene by inputting the feature vectors of each frame constituting the scene into LSTM.
(Section 4)
a feature extraction unit that extracts, from an input video, a video vector that is a feature of the video and a text vector that is a feature of a text attached to the video;
a vector synthesis unit that generates a composite vector by synthesizing the video vector and the text vector;
a tree structure estimation unit that estimates a tree structure and a label representing a rhetorical structure tree of the video based on the composite vector;
Parameter optimization for optimizing parameters of a neural network corresponding to the tree structure estimator using the tree structure and the label estimated by the tree structure estimator, and correct answer data representing the rhetorical structure tree of the video. A parameter learning device comprising a conversion section and.
(Section 5)
A tree structure estimation method executed by a tree structure estimation device, comprising:
a feature extraction step of extracting a video vector that is a feature amount of the video and a text vector that is a feature amount of a text attached to the video from the input video;
a vector compositing step of composing the video vector and the text vector to generate a composite vector;
A tree structure estimation method comprising: estimating a tree structure representing a rhetorical structure tree of the video and a label from the composite vector.
(Section 6)
A parameter learning method executed by a parameter learning device, comprising:
a feature extraction step of extracting a video vector that is a feature amount of the video and a text vector that is a feature amount of a text attached to the video from the input video;
a vector compositing step of composing the video vector and the text vector to generate a composite vector;
a tree structure estimation step of estimating a tree structure and a label representing a rhetorical structure tree of the video based on the composite vector;
parameter optimization for optimizing the parameters of a neural network that executes the tree structure estimation step using the tree structure and the label estimated in the tree structure estimation step, and correct answer data representing the rhetorical structure tree of the video; A parameter learning method comprising a transformation step and .
(Section 7)
A program for causing a computer to function as each part of the tree structure estimating device according to any one of items 1 to 3.
(Section 8)
A program for causing a computer to function as each part of the parameter learning device according to item 4.

以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to such specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention as described in the claims. It is possible.

１００動画談話構造解析装置
１１０データ入力部
１１１キャプション生成部
１１２スパンベクトル生成部
１２０木構造推定部
１２１木構造推定処理部
１２２ラベル推定部
１３０パラメタ学習部
１３１特徴量抽出部
１３２動画特徴抽出部
１３３テキスト特徴抽出部
１３４ベクトル合成部
１３５木構造推定処理部
１３６ラベル推定部
１３７パラメタ最適化部
１４０データ出力部
１０００ドライブ装置
１００１記録媒体
１００２補助記憶装置
１００３メモリ装置
１００４ＣＰＵ
１００５インタフェース装置
１００６表示装置
１００７入力装置
１００８出力装置 100 Video discourse structure analysis device 110 Data input unit 111 Caption generation unit 112 Span vector generation unit 120 Tree structure estimation unit 121 Tree structure estimation processing unit 122 Label estimation unit 130 Parameter learning unit 131 Feature amount extraction unit 132 Video feature extraction unit 133 Text Feature extraction unit 134 Vector synthesis unit 135 Tree structure estimation processing unit 136 Label estimation unit 137 Parameter optimization unit 140 Data output unit 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device 1008 Output device

Claims

A feature extraction unit that extracts, from an input video, a video vector that is a feature of the video and a text vector that is a feature of text added to the video;
a vector synthesis unit that synthesizes the video vector and the text vector to generate a synthetic vector;
a tree structure estimation unit that estimates a tree structure representing a rhetorical structure tree of the video and a label from the composite vector.

2. A leaf of the rhetorical structure tree corresponds to a scene of the video and a caption for the scene, a node label corresponds to a core of a scene sequence, and an edge label corresponds to a rhetorical relationship between scene sequences. tree structure estimation device.

The tree structure estimation device according to claim 1 or 2, wherein the feature extraction unit creates a video vector corresponding to the scene by inputting feature vectors of each frame constituting the scene into LSTM.

a feature extraction unit that extracts, from an input video, a video vector that is a feature of the video and a text vector that is a feature of a text attached to the video;
a vector synthesis unit that generates a composite vector by synthesizing the video vector and the text vector;
a tree structure estimation unit that estimates a tree structure and a label representing a rhetorical structure tree of the video based on the composite vector;
Parameter optimization for optimizing parameters of a neural network corresponding to the tree structure estimator using the tree structure and the label estimated by the tree structure estimator, and correct answer data representing the rhetorical structure tree of the video. A parameter learning device comprising a conversion section and.

A tree structure estimation method executed by a tree structure estimation device, comprising:
A feature extraction step of extracting from an input video a video vector that is a feature of the video and a text vector that is a feature of text added to the video;
a vector synthesis step of synthesizing the video vector and the text vector to generate a synthetic vector;
and a tree structure estimation step of estimating a tree structure and a label representing a rhetorical structure tree of the video from the composite vector.

A parameter learning method executed by a parameter learning device, comprising:
a feature extraction step of extracting a video vector that is a feature amount of the video and a text vector that is a feature amount of a text attached to the video from the input video;
a vector compositing step of composing the video vector and the text vector to generate a composite vector;
a tree structure estimation step of estimating a tree structure and a label representing a rhetorical structure tree of the video based on the composite vector;
parameter optimization for optimizing the parameters of a neural network that executes the tree structure estimation step using the tree structure and the label estimated in the tree structure estimation step, and correct answer data representing the rhetorical structure tree of the video; A parameter learning method comprising a transformation step and .

A program for causing a computer to function as each part of the tree structure estimating device according to any one of claims 1 to 3.

A program for causing a computer to function as each part of the parameter learning device according to claim 4.