JP2023103966A

JP2023103966A - Method and device for constructing transformer model for video story question answering

Info

Publication number: JP2023103966A
Application number: JP2022199912A
Authority: JP
Inventors: ジャン，ビョン－タク; Byoung-Tak Zhang; チェ，ソンホ; Seongho Choi
Original assignee: Seoul National University Industry Foundation; Seoul National University R&DB Foundation
Current assignee: Seoul National University Industry Foundation; SNU R&DB Foundation
Priority date: 2022-01-14
Filing date: 2022-12-15
Publication date: 2023-07-27
Also published as: WO2023136417A1; KR20230109931A

Abstract

To provide a device, a method and a program for constructing a transformer model for video story question answering, which learn a video story by considering the context of prior and subsequent video clips included in video data.SOLUTION: The device comprises: an input-output unit for receiving video data, which includes a plurality of consecutive video clips, and question data for video question answering, and outputting the result from video story question answering; a storage unit in which a program and data for performing the video story question answering are stored; and a control unit that includes at least one processor and constructs a transformer model for the video story question answering by executing the program. The control unit learns a video story from the video data including the plurality of consecutive video clips by considering the context of neighboring prior and subsequent video clips according to chronological order.SELECTED DRAWING: Figure 3

Description

特許法第３０条第２項適用申請有り令和３年１２月２０日ＫｏｒｅａＳｏｆｔｗａｒｅＣｏｎｇｒｅｓｓ２０２１（韓国情報科学会２０２１韓国ソフトウェア総合学術大会論文集）の「２７０．ビデオ質疑応答のためのマルチモーダル文脈トランスフォーマー（８０１ページ～８０３ページ）」Patent Act Article 30, Paragraph 2 application filed Dec. 20, 2021 Korea Software Congress 2021 (Proceedings of the 2021 Korean Software Comprehensive Conference of the Korean Institute of Information Science and Technology) "270. Multimodal context for video Q&A Transformers (pages 801-803)”

この明細書で開示する実施例はビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置及び方法に関するものであり、より詳しくは、ビデオデータに含まれたビデオクリップの前後の文脈を考慮してビデオストーリーを学習するビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置及び方法に関するものである。 Embodiments disclosed herein relate to an apparatus and method for constructing a transformer model for video story question answering, and more particularly, to constructing a transformer model for video story question answering. An apparatus and method for building a transformer model for story learning video story question answering.

本研究は科学技術情報通信部及び情報通信企画評価院（ＩＩＴＰ）の革新成長動力プロジェクト事業の「ビデオチューリングテストを通過する水準のビデオストーリー理解基盤の質問応答技術開発（ＩＩＴＰ－２０１７－０－０１７７２－００５）」課題及びＳＷコンピューティング産業源泉技術開発事業の「（ＳＷスターラブ）日常生活学習基盤の認知エージェントＳＷ開発（ＩＩＴＰ－２０１５－０－００３１０－００７）」課題に対する研究結果として遂行された。 This research is a part of the innovative growth engine project project of the Ministry of Science, Technology and Information Communication and the Institute of Information and Communication Planning and Evaluation (IITP). -005)” and “(SW Star Love) Cognitive Agent SW Development for Daily Life Learning Base (IITP-2015-0-00310-007)” of the SW Computing Industry Source Technology Development Project. .

最近では、視覚及び自然語処理に対する深層学習技術の発展のおかげで、マルチモーダルデータに対する関心が高くなっており、ビデオデータに対する理解を測定する多くの形態のタスク（ｔａｓｋ）が注目されている。 Recently, thanks to the development of deep learning techniques for vision and natural language processing, interest in multimodal data has increased, and many forms of tasks that measure understanding of video data have attracted attention.

多様な形態のタスクのうち、ビデオ質問応答（ＶｉｄｅｏＱｕｅｓｔｉｏｎＡｎｓｗｅｒｉｎｇ）はビデオ理解能力を自然語形態の五択の客観式問題の正確度によって測定する。特に、ビデオ質問応答を解決するためには、マルチモーダルビデオに登場する多様なデータの複雑な相関関係を学習し、与えられた質問応答についての核心情報を探さなければならない。 Among various forms of tasks, Video Question Answering measures video comprehension ability by the accuracy of five-choice objective questions in natural language form. In particular, in order to solve video question answering, it is necessary to learn complex correlations of various data appearing in multimodal videos and search for core information about given question answers.

最近では、これを解決するために、トランスフォーマー（ｔｒａｎｓｆｏｒｍｅｒ）に基づいて大規模学習を遂行したモデルが紹介されている。ビデオのための大規模辞書学習はビデオ理解を評価するための多様なタスクで相当な性能を現しており、自然語処理で良い性能を発揮したモデルに基づいて構築された。しかし、従来の技術によれば、ビデオストーリー質問応答を遂行するトランスフォーマーの場合、ビデオの長さが増加するのに伴って計算費用が幾何級数的に増加するので、長さの短いビデオに対してのみ処理が可能であるという問題点があった。 Recently, in order to solve this problem, a large-scale learning model based on a transformer has been introduced. Large-scale dictionary learning for videos has shown considerable performance on a variety of tasks for evaluating video comprehension and is built on models that have performed well in natural language processing. However, according to the prior art, in the case of a transformer that performs video story question answering, the computational cost increases exponentially as the length of the video increases. However, there was a problem that it was possible to process only

一方、前述した背景技術は発明者が本発明の導出のために保有しているか本発明の導出過程で習得した技術情報であり、必ずしも本発明の出願前に一般の公衆に公開された公知技術であるとは言えない。 On the other hand, the above-mentioned background art is technical information that the inventor has for the derivation of the present invention or acquired during the derivation process of the present invention, and is not necessarily a publicly known technology disclosed to the general public before the filing of the present invention. I cannot say that it is.

韓国公開特許第１０－２０２０－０１４４４１７号公報Korean Patent Publication No. 10-2020-0144417

この明細書で開示される実施例は、ビデオデータに含まれたビデオクリップの前後の文脈を考慮してビデオストーリーを学習するビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置及び方法を提供することにその目的がある。 Embodiments disclosed herein provide an apparatus and method for building a transformer model for video story question answering that learns video stories by considering the context before and after video clips contained in the video data. It has a purpose.

本発明の他の目的及び利点は下記の説明によって理解することができ、一実施例よってより明らかになるであろう。また、本発明の目的及び利点は特許請求の範囲に示す手段及びその組合せによって実現することができることが容易に分かるであろう。 Other objects and advantages of the present invention can be comprehended by the following description and will be made clearer by an embodiment. Also, it will be readily understood that the objects and advantages of the present invention can be achieved by means and combinations thereof as set forth in the appended claims.

上述した技術的課題を果たすための技術的手段として、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置は、複数の連続的なビデオクリップを含むビデオデータ及びビデオ質問応答のための質問データを受け、ビデオストーリー質問応答の結果を出力するための入出力部と、ビデオストーリー質問応答を遂行するためのプログラム及びデータを保存する保存部と、少なくとも一つのプロセッサを含み、前記プログラムを実行させることによってビデオストーリー質問応答のためのトランスフォーマーモデルを構築する制御部とを含み、前記制御部は、前記複数の連続的なビデオクリップを含むビデオデータから時間的順序に従って互いに隣接した前後のビデオクリップの文脈を考慮してビデオストーリーを学習させることを特徴とする。 As a technical means for achieving the above technical problems, an apparatus for constructing a transformer model for video story question answering includes video data including a plurality of continuous video clips and question data for video question answering. an input/output unit for receiving and outputting the result of the video story question answering; a storage unit for storing programs and data for performing the video story question answering; and at least one processor for executing the program. a controller for constructing a transformer model for video story question answering, wherein the controller extracts contexts of preceding and succeeding video clips adjacent to each other in temporal order from the video data including the plurality of consecutive video clips. is characterized by learning a video story in consideration of

他の実施例によれば、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置が遂行するビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法は、複数の連続的なビデオクリップを含むビデオデータ及びビデオ質問応答のための質問データを受ける段階と、前記複数の連続的なビデオクリップを含むビデオデータから時間的順序に従って互いに隣接した前後ビデオクリップの文脈を考慮してビデオストーリーを学習させる段階とを含む。 According to another embodiment, a method of constructing a transformer model for video story question answering performed by an apparatus for constructing a transformer model for video story question answering comprises video data comprising a plurality of consecutive video clips; and receiving question data for video question answering; and learning a video story from the video data including the plurality of consecutive video clips by considering contexts of preceding and succeeding video clips adjacent to each other in temporal order. including.

さらに他の実施例によれば、記録媒体は、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法を実行するプログラムが記録されたコンピュータ可読の記録媒体である。ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置が遂行するビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法は、複数の連続的なビデオクリップを含むビデオデータ及びビデオ質問応答のための質問データを受ける段階と、前記複数の連続的なビデオクリップを含むビデオデータから時間的順序に従って互いに隣接した前後ビデオクリップの文脈を考慮してビデオストーリーを学習させる段階とを含む。 According to yet another embodiment, the recording medium is a computer-readable recording medium having recorded thereon a program for executing the method of constructing a transformer model for video story question answering. A method for constructing a transformer model for video story question answering performed by a device for constructing a transformer model for video story question answering includes video data including a plurality of continuous video clips and a question for video question answering. receiving data; and learning a video story from the video data including the plurality of consecutive video clips considering the context of preceding and succeeding video clips that are adjacent to each other in temporal order.

さらに他の実施例によれば、コンピュータプログラムは、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置によって実行され、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法を遂行するために記録媒体に記録されたコンピュータプログラムである。ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置が遂行するビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法は、複数の連続的なビデオクリップを含むビデオデータ及びビデオ質問応答のための質問データを受ける段階と、前記複数の連続的なビデオクリップを含むビデオデータから時間的順序に従って互いに隣接した前後ビデオクリップの文脈を考慮してビデオストーリーを学習させる段階とを含む。 According to yet another embodiment, a computer program is executed by an apparatus for building a transformer model for video story question answering and a recording medium for performing the method for building a transformer model for video story question answering. A computer program recorded in A method for constructing a transformer model for video story question answering performed by a device for constructing a transformer model for video story question answering includes video data including a plurality of continuous video clips and a question for video question answering. receiving data; and learning a video story from the video data including the plurality of consecutive video clips considering the context of preceding and succeeding video clips that are adjacent to each other in temporal order.

前述した課題解決手段のうちのいずれか一つによれば、ビデオデータに含まれたビデオクリップの前後の文脈を考慮したトランスフォーマーを構築してビデオストーリー質問応答を遂行するにあたり、大きな計算費用をかけないながらも長いビデオを効果的に処理することができる効果がある。 According to any one of the above-described problem-solving means, a large computational cost is incurred in constructing a transformer considering the context before and after the video clip included in the video data and performing the video story question answering. Although it is not, it has the effect of being able to process long videos effectively.

前述した課題解決手段のうちの他の一つによれば、ビデオデータに含まれたビデオクリップの前後の文脈を考慮したトランスフォーマーを構築することで、ビデオストーリー質問応答だけでなく、ビデオの後続の場面の予測、因果関係の推論などの多様な分野に活用することができる効果がある。 According to another of the above-mentioned problem-solving means, by constructing a transformer that considers the context before and after the video clip included in the video data, not only the video story question answering but also the subsequent video It has the effect of being able to be used in various fields such as scene prediction and causal relationship inference.

開示する実施例で得られる効果は以上で言及した効果に制限されず、言及しなかった他の効果は下記の記載で開示する実施例が属する技術分野で通常の知識を有する者に明らかに理解可能であろう。 The effects obtained in the disclosed embodiments are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those having ordinary knowledge in the technical field to which the embodiments disclosed in the following description belong. It would be possible.

以下、添付図面はこの明細書で開示する好適な実施例を例示するものであり、発明を実施するための具体的な内容とともにこの明細書に開示する技術思想をもっと理解させる役割を果たすものであるので、この明細書に開示する内容は図面に記載した事項のみに限定されて解釈されてはいけない。 The accompanying drawings below illustrate the preferred embodiments disclosed in this specification, and serve to make the technical ideas disclosed in this specification more comprehensible together with the specific contents for carrying out the invention. Therefore, the contents disclosed in this specification should not be construed as being limited only to the matters described in the drawings.

従来技術によるトランスフォーマーモデルを説明するための図である。It is a figure for demonstrating the transformer model by a prior art. 一実施例によるトランスフォーマーモデルを説明するための図である。FIG. 4 is a diagram for explaining a transformer model according to one embodiment; 一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置の機能ブロック図である。1 is a functional block diagram of an apparatus for building a transformer model for video story question answering according to one embodiment; FIG. 一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法を説明するためのフローチャートである。4 is a flow chart illustrating a method of building a transformer model for video story question answering according to one embodiment;

以下では添付図面に基づいて多様な実施例を詳細に説明する。以下で説明する実施例は様々な相異なる形態に変形されて実施されることもできる。実施例の特徴をより明確に説明するために、以下の実施例が属する技術分野で通常の知識を有する者に広く知られている事項についての詳細な説明は省略する。そして、図面で実施例の説明に関係ない部分は省略し、明細書全般にわたって類似の部分に対しては類似の図面符号を付けた。 Various exemplary embodiments are described in detail below with reference to the accompanying drawings. The embodiments described below can be modified and implemented in various different forms. In order to more clearly describe the features of the embodiments, detailed descriptions of matters widely known to those of ordinary skill in the technical field to which the following embodiments belong are omitted. In the drawings, parts that are not related to the description of the embodiments are omitted, and similar parts are given similar reference numerals throughout the specification.

明細書全般で、ある構成が他の構成と連結されていると言うとき、これは直接的に連結されている場合だけではなく、その中間に他の構成を挟んで連結されている場合も含む。また、ある構成が他の構成を含むというとき、特に反対の記載がない限り、さらに他の構成を除くものではなくて他の構成をさらに含むこともできることを意味する。 Throughout the specification, when it is said that a configuration is connected to another configuration, this includes not only cases in which it is directly connected but also cases in which it is connected with another configuration in between. . Further, when a configuration includes another configuration, it does not mean that the other configuration is excluded, and that the other configuration can be included, unless otherwise specified.

以下、添付図面に基づいて実施例を詳細に説明する。 An embodiment will be described in detail below with reference to the accompanying drawings.

図１は従来技術によるトランスフォーマーモデルを説明するための図である。 FIG. 1 is a diagram for explaining a transformer model according to the prior art.

図１は従来技術によるトランスフォーマーモデルを示すものであり、ビデオ表現（ｒｅｐｒｅｓｅｎｔａｔｉｏｎ）学習に用いられるトランスフォーマーモデルの構造を示す。ここで、トランスフォーマーモデルはバニラトランスフォーマー（ＶａｎｉｌｌａＴｒａｎｓｆｏｒｍｅｒ）であり得る。一方、図１に示すトランスフォーマーは、エンコーダー１００がすべてのビデオフレームに対して階層（ｌａｙｅｒ）ごとに分離されるように設定されることができる。ここで、図１に示すトランスフォーマーにおいて、それぞれの区間Ｓ_１、Ｓ_２、Ｓ_３に対して分離されたエンコーダー１００は時間的トランスフォーマー（ＴｅｍｐｏｒａｌＴｒａｎｓｆｏｒｍｅｒ）であり得る。図１に示すトランスフォーマーモデルを用いてビデオストーリー質問応答を遂行することができるが、図１に示すようなトランスフォーマーはそれぞれの区間Ｓ_１、Ｓ_２、Ｓ_３に対して分離されたエンコーダー１００が入力されるビデオデータに含まれたビデオクリップの前後の脈絡を考慮していないから、ビデオの長さが長くなる場合、計算費用が幾何級数的に増加する問題が発生するので、短い長さを有するビデオストーリー質問応答にのみ使われた。したがって、長いビデオをより効果的に処理することができるトランスフォーマーが必要になり、よってビデオデータに含まれたビデオクリップの前後の文脈を考慮したトランスフォーマーが構築された。一実施例によるビデオデータに含まれたビデオクリップの前後の文脈を考慮したトランスフォーマーについては図２及び図３を参照してより詳細に後述する。 FIG. 1 shows a transformer model according to the prior art, showing the structure of a transformer model used for video representation learning. Here, the transformer model can be Vanilla Transformer. On the other hand, the transformer shown in FIG. 1 can be set such that the encoder 100 is separated into layers for all video frames. Here, in the transformer shown in FIG. 1, the encoders 100 separated for each section _S1 , _S2 , _S3 may be temporal transformers. Video story question answering can be performed using the transformer model shown in _FIG _{. 1.} The transformer as shown in FIG _. Since the context of the video clip included in the video data to be processed is not considered, if the length of the video increases, the calculation cost increases geometrically, so the length is short. Used only for video story question answering. Therefore, there is a need for transformers that can process long videos more effectively, and transformers have been built that take into account the context before and after video clips contained in the video data. A transformer that considers the context before and after a video clip included in video data according to one embodiment will be described in more detail below with reference to FIGS. 2 and 3. FIG.

図２は一実施例によるトランスフォーマーモデルを説明するための図である。 FIG. 2 is a diagram for explaining a transformer model according to one embodiment.

図２は一実施例によるトランスフォーマーモデルを示すものであり、ビデオ表現（ｒｅｐｒｅｓｅｎｔａｔｉｏｎ）学習に用いられたトランスフォーマーモデルの構造を示す。ここで、トランスフォーマーモデルは文脈的トランスフォーマー（ＣｏｎｔｅｘｔｕａｌＴｒａｎｓｆｏｒｍｅｒ）であり得る。一方、図２に示すトランスフォーマーは、エンコーダー２００がすべてのビデオフレームに対して階層（ｌａｙｅｒ）ごとに分離されるように設定されることができる。図２に示すトランスフォーマーは、それぞれの区間Ｓ_１、Ｓ_２、Ｓ_３に対して分離されたエンコーダー２００が入力されるビデオデータに含まれたビデオクリップの前後の脈絡を考慮してビデオストーリーを学習することにより、階層が高くなるのに伴い、考慮することができる前後区間のビデオクリップの個数が変わることができる。ここで、ビデオクリップは短く録画された動画を意味することができる。例えば、第２階層のＳ_１及びＳ_３の区間では２個の区間を考慮することができ、第２階層のＳ_２区間では３個の区間を考慮することができる。ここで、ビデオデータは複数の連続的なビデオクリップを含むことができ、上述したビデオクリップは、複数のビジュアルトークン（ｖｉｓｕａｌｔｏｋｅｎ）と、テキストトークン（ｔｅｘｔｔｏｋｅｎ）とを含むことができる。一方、図２に示す文脈的トランスフォーマーにおいて、それぞれの区間Ｓ_１、Ｓ_２、Ｓ_３に対して分離されたエンコーダー２００はクロスモーダルトランスフォーマー（Ｃｒｏｓｓ－ｍｏｄａｌＴｒａｎｓｆｏｒｍｅｒ）であり得、上述したクロスモーダルトランスフォーマーは、各区間Ｓ_１、Ｓ_２、Ｓ_３に対応するビジュアルトークン（ｖｉｓｕａｌｔｏｋｅｎ）及びテキストトークン（ｔｅｘｔｔｏｋｅｎ）を入力として受けることができる。図２に示すようなトランスフォーマーは、それぞれの区間Ｓ_１、Ｓ_２、Ｓ_３に対して分離されたエンコーダー１００が入力されるビデオデータに含まれたビデオクリップの前後の脈絡を考慮してビデオストーリーを学習するので、大きな計算費用をかけないながらも長いビデオを効果的に処理することができる。また、ビデオデータに含まれたビデオクリップの前後の文脈を考慮したトランスフォーマーを構築することで、ビデオストーリー質問応答だけでなく、ビデオの後続の場面の予測、因果関係の推論などの多様な分野に活用することができる効果がある。 FIG. 2 shows a transformer model according to one embodiment, showing the structure of the transformer model used for video representation learning. Here, the Transformer Model may be a Contextual Transformer. On the other hand, the transformer shown in FIG. 2 can be set such that the encoder 200 is separated into layers for all video frames. The transformer shown in FIG. 2 learns the video story by considering the context before and after the video clip included in the video data input to the encoder 200 separated for each section _S1 , _S2 , and _S3 . By doing so, the number of video clips in the preceding and succeeding sections that can be considered can be changed as the hierarchy becomes higher. Here, a video clip can mean a short recorded moving picture. For example, two sections can be considered in the sections _S1 and _S3 of the second layer, and three sections can be considered in the section _S2 of the second layer. Here, the video data may include multiple continuous video clips, and the video clip may include multiple visual tokens and text tokens. On the other hand, in the contextual transformer shown in FIG. 2, the separate encoders 200 for each interval S ₁ , S ₂ , S ₃ can be cross-modal transformers, where the cross-modal transformer described above is , a visual token and a text token corresponding to each interval S ₁ , S ₂ , S ₃ can be received as input. The transformer shown in FIG. 2 considers the context before and after the video clip included in the video data input to the encoder 100 separated for each of the sections _S1 , _S2 , and _S3 . so that long videos can be processed effectively without incurring a large computational cost. In addition, by constructing a transformer that considers the context before and after the video clip included in the video data, it can be applied not only to video story question answering, but also to various fields such as predicting the next scene of the video and inferring causality. There are effects that can be used.

一方、上述した図２のトランスフォーマーは、図３に示すビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置によって構築することができる。 On the other hand, the transformer of FIG. 2 described above can be built by the apparatus for building a transformer model for video story question answering shown in FIG.

図３は一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置の機能ブロック図である。 FIG. 3 is a functional block diagram of an apparatus for building a transformer model for video story question answering according to one embodiment.

図３を参照すると、一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置３００は、入出力部３１０、保存部３２０、及び制御部３３０を含む。 Referring to FIG. 3, an apparatus 300 for building a transformer model for video story question answering according to one embodiment includes an input/output unit 310, a storage unit 320, and a control unit 330. FIG.

入出力部３１０は、使用者からの入力を受信するための入力部と、作業の遂行結果またはビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置３００の状態などの情報を表示するための出力部とを含むことができる。すなわち、入出力部３１０は、複数の連続的なビデオクリップを含むビデオデータ及びビデオ質問応答のための質問データを受け、ビデオストーリー質問応答の結果を出力するための構成である。ここで、ビデオクリップは、複数のビジュアルトークン（ｖｉｓｕａｌｔｏｋｅｎ）と、テキストトークン（ｔｅｘｔｔｏｋｅｎ）とを含むことができる。 The input/output unit 310 includes an input unit for receiving input from a user and an output for displaying information such as the result of performing a task or the state of the apparatus 300 for constructing a transformer model for video story question answering. and That is, the input/output unit 310 receives video data including a plurality of continuous video clips and question data for video question answering, and outputs the result of the video story question answering. Here, the video clip may include multiple visual tokens and text tokens.

保存部３２０はファイル及びプログラムを保存することができる構成であり、多様な種類のメモリから構成されることができる。特に、保存部３２０は、後述する制御部３３０が以下で提示するアルゴリズムに従ってビデオストーリー質問応答のためのトランスフォーマーモデルを構築することができるようにするデータ及びプログラムを保存することができる。 The storage unit 320 is configured to store files and programs, and may be configured with various types of memory. In particular, the storage unit 320 can store data and programs that enable the control unit 330 to build a transformer model for video story question answering according to an algorithm presented below.

制御部３３０は、ＣＰＵ、ＧＰＵ、アルデュイーノなどのような少なくとも一つのプロセッサを含む構成であり、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置３００の全体動作を制御することができる。すなわち、制御部３３０は、ビデオストーリー質問応答を遂行するようにビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置３００に含まれた他の構成を制御することができる。制御部３３０は、保存部３２０に保存されたプログラムを実行することで、以下で提示するアルゴリズムに従ってビデオストーリー質問応答のためのトランスフォーマーモデルを構築する演算を遂行することができる。制御部３３０がビデオストーリー質問応答のためのトランスフォーマーモデルを構築する演算を遂行する方法については後述する。 The control unit 330 includes at least one processor such as a CPU, GPU, Arduino, etc., and can control the overall operation of the apparatus 300 for constructing a transformer model for video story question answering. That is, the control unit 330 can control other components included in the apparatus 300 for building a transformer model for video story question answering so as to perform video story question answering. The control unit 330 may execute the program stored in the storage unit 320 to perform operations for constructing a transformer model for video story question answering according to an algorithm presented below. The method by which the control unit 330 performs operations for constructing a transformer model for video story question answering will be described later.

以下では、制御部３３０が保存部３２０に保存されたプログラムを実行させることで、一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法を遂行する過程について詳細に説明する。 Hereinafter, a process of performing a method of constructing a transformer model for video story question answering according to an embodiment by causing the control unit 330 to execute a program stored in the storage unit 320 will be described in detail.

一実施例によれば、制御部３３０は、上述した式２を使用してビデオストーリー質問応答のためのトランスフォーマーモデルを構築することができる。 According to one embodiment, the controller 330 can construct a transformer model for video story question answering using Equation 2 above.

制御部３３０は、分離されたそれぞれのエンコーダーを介して予め設定された区間に相当するビデオクリップ別にそれぞれのビデオクリップに含まれるビジュアルトークン及びテキストトークンを受け、互いに隣接した前後のビデオクリップの下位階層（ｌｏｗｅｒｌａｙｅｒ）の隠れた表現（ｈｉｄｄｅｎｒｅｐｒｅｓｅｎｔａｔｉｏｎ）を計算し、計算された隠れた表現を活用して前後の脈絡を考慮したビデオデータの表現を計算することで、ビデオストレージを学習させることができる。ここで、制御部３３０は、ビデオクリップ別にマスクモダリティモデル（ＭａｓｋｅｄＭｏｄａｌｉｔｙＭｏｄｅｌ、以下、ＭＭＭという）を用いて時間的順序（ｔｅｍｐｏｒａｌｏｒｄｅｒ）を学習することができる。マスクモダリティモデル（ＭＭＭ）は、既存のモデルであるマスクランゲージモデル（ＭａｓｋｅｄＬａｎｇｕａｇｅＭｏｄｅｌ）で提案されたトークン（ｔｏｋｅｎ）単位のマスキング（ｍａｓｋｉｎｇ）技法を所定の区間のトークン（ｔｏｋｅｎ）全体に対するマスキング（ｍａｓｋｉｎｇ）に確張したものであり得る。マスクモダリティモデル（ＭＭＭ）は、一つのモダリティ（Ｍｏｄａｌｉｔｙ）が他のモダリティ（Ｍｏｄａｌｉｔｙ）から生成できるようにするとともに、エンコーダーが周辺のトークン（ｔｏｋｅｎ）からあまりにも容易にマスクトークン（ｍａｓｋｅｄｔｏｋｅｎ）を生成することを防止することができ、モダリティ（ｍｏｄａｌｉｔｙ）間の整列（ａｌｉｇｎｍｅｎｔ）を学習させることができる。ここで、モダリティは映像及びテキストなどであり得る。したがって、一実施例による文脈的トランスフォーマーを用いて上述した学習を遂行すると、前後の文脈に基づいてセグメント（例えば、区間別に分離されたビデオデータ）についての内容を予測することができるので、自然な話の流れを学習することができる。 The control unit 330 receives visual tokens and text tokens included in each video clip for each video clip corresponding to a preset section through each separated encoder, and receives the visual tokens and text tokens included in each video clip, and the lower layers of adjacent video clips. The video storage can be trained by computing a hidden representation of the (lower layer) and utilizing the computed hidden representation to compute a contextual representation of the video data. . Here, the controller 330 can learn the temporal order using a masked modality model (MMM) for each video clip. The mask modality model (MMM) is a token-based masking technique proposed in an existing model, the Masked Language Model, by masking the entire token of a predetermined interval. ). A masked modality model (MMM) allows one modality to generate from another modality, while the encoder too easily generates masked tokens from surrounding tokens. can be prevented, and the alignment between modalities can be learned. Here, modalities can be video, text, and the like. Therefore, by performing the above-described learning using the contextual transformer according to one embodiment, it is possible to predict the content of a segment (for example, video data separated by sections) based on the context before and after it, so that natural You can learn the flow of the story.

一方、マスクモダリティモデル（ＭａｓｋｅｄＭｏｄａｌｉｔｙＭｏｄｅｌ）は否定対照学習（ＮｅｇａｔｉｖｅＣｏｎｔｒａｓｔｉｖｅＬｅａｒｎｉｎｇ）によって学習することができる。ここで、マスクモダリティモデル（ＭａｓｋｅｄＭｏｄａｌｉｔｙＭｏｄｅｌ）は次の式３の通りに示すことができる。 On the other hand, a masked modality model can be learned through negative contrastive learning. Here, a masked modality model can be expressed as Equation 3 below.

図４は一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法を説明するためのフローチャートである。 FIG. 4 is a flowchart illustrating a method of building a transformer model for video story question answering according to one embodiment.

図４に示す実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法は、図２及び図３に示したビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置１００で時系列的に処理する段階を含む。したがって、以下で省略した内容であると言っても、図２及び図３に示したビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置１００について以上で記述した内容は図４に示す実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法にも適用することができる。 The method of building a transformer model for video story question answering according to the embodiment shown in FIG. 4 is processed chronologically by the apparatus 100 for building a transformer model for video story question answering shown in FIGS. including the step of Therefore, although omitted below, the above description of the apparatus 100 for constructing a transformer model for video story question answering shown in FIGS. 2 and 3 is based on the embodiment shown in FIG. It can also be applied to the method of building a transformer model for video story question answering.

図４を参照すると、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置１００は、複数の連続的なビデオクリップを含むビデオデータ及びビデオ質問応答のための質問データを受けることができる（Ｓ４１０）。ここで、ビデオクリップは、複数のビジュアルトークン（ｖｉｓｕａｌｔｏｋｅｎ）及びテキストトークン（ｔｅｘｔｔｏｋｅｎ）を含むことができる。 Referring to FIG. 4, the apparatus 100 for building a transformer model for video story question answering can receive video data including a plurality of continuous video clips and question data for video question answering (S410). . Here, the video clip may include multiple visual tokens and text tokens.

ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置１００は、Ｓ４１０段階で受けた複数の連続的なビデオクリップを含むビデオデータから時間的順序に従って互いに隣接した前後のビデオクリップの文脈を考慮してビデオストーリーを学習させることができる（Ｓ４２０）。ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置１００は、分離されたそれぞれのエンコーダーを介して予め設定された区間に相当する前記ビデオクリップ別に各ビデオクリップに含まれるビジュアルトークン及びテキストトークンを入力として受け、前記互いに隣接した前後のビデオクリップの下位階層（ｌｏｗｅｒｌａｙｅｒ）の隠れた表現（ｈｉｄｄｅｎｒｅｐｒｅｓｅｎｔａｔｉｏｎ）を計算し、前記計算された隠れた表現を活用して前後の脈絡を考慮したビデオデータの表現を計算することで、ビデオストレージを学習させることができる。ここで、ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置１００は、ビデオクリップ別にマスクモダリティモデル（ＭａｓｋｅｄＭｏｄａｌｉｔｙＭｏｄｅｌ）を用いて時間的順序（ｔｅｍｐｏｒａｌｏｒｄｅｒ）を学習することができる。マスクモダリティモデル（ＭＭＭ）は、既存のモデルであるマスクランゲージモデル（ＭａｓｋｅｄＬａｎｇｕａｇｅＭｏｄｅｌ）で提案された、トークン（ｔｏｋｅｎ）単位のマスキング（ｍａｓｋｉｎｇ）技法を所定の区間のトークン（ｔｏｋｅｎ）全体に対するマスキング（ｍａｓｋｉｎｇ）に確張したものであり得る。マスクモダリティモデル（ＭＭＭ）は一つのモダリティ（Ｍｏｄａｌｉｔｙ）が他のモダリティ（Ｍｏｄａｌｉｔｙ）から生成できるようにするとともに、エンコーダーが周辺のトークン（ｔｏｋｅｎ）からあまりにも容易にマスクトークン（ｍａｓｋｅｄｔｏｋｅｎ）を生成することを防止することができ、モダリティ（ｍｏｄａｌｉｔｙ）間の整列（ａｌｉｇｎｍｅｎｔ）を学習させることができる。一方、マスクモダリティモデル（ＭａｓｋｅｄＭｏｄａｌｉｔｙＭｏｄｅｌ）は否定対照学習（ＮｅｇａｔｉｖｅＣｏｎｔｒａｓｔｉｖｅＬｅａｒｎｉｎｇ）によって学習することができる。ここで、マスクモダリティモデル（ＭａｓｋｅｄＭｏｄａｌｉｔｙＭｏｄｅｌ）は上述した式３の通りに示すことができる。 The apparatus 100 for constructing a transformer model for video story question answering considers the context of the preceding and succeeding video clips adjacent to each other in temporal order from the video data including a plurality of continuous video clips received in step S410. A video story can be learned (S420). The apparatus 100 for constructing a transformer model for video story question answering inputs visual tokens and text tokens included in each video clip for each video clip corresponding to a preset section through separate encoders. , calculating hidden representations of the lower layers of the adjacent video clips before and after each other, and utilizing the calculated hidden representations of the video data considering the context of the preceding and following The video storage can be trained by computing representations. Here, the apparatus 100 for constructing a transformer model for video story question answering can learn temporal order using a masked modality model for each video clip. The mask modality model (MMM) is a token-based masking technique proposed in an existing model, the Masked Language Model, by masking the entire token of a predetermined interval. masking). A masked modality model (MMM) allows one modality to be generated from another modality, while the encoder too easily generates masked tokens from surrounding tokens. can be prevented, and the alignment between modalities can be learned. On the other hand, a masked modality model can be learned through negative contrastive learning. Here, the masked modality model can be expressed as Equation 3 above.

以上の実施例で使われる‘～部’という用語はソフトウェア又はＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）又はＡＳＩＣのようなハードウェア構成要素を意味し、‘～部’はある役割をする。しかし、‘～部’はソフトウェア又はハードウェアに限定される意味ではない。‘～部’はアドレス可能な記憶媒体にあるように構成されることもでき、一つ又はそれ以上のプロセッサを再生させるように構成されることもできる。よって、一例として、‘～部’はソフトウェア構成要素、オブジェクト指向ソフトウェア構成要素、クラス構成要素及びタスク構成要素のような構成要素と、プロセス、関数、属性、プロシージャ、サブルーチン、プログラム特許コードのセグメント、ドライバー、ファームウエア、マイクロコード、回路、データ、データベース、データ構造、テーブル、アレイ、及び変数を含む。 The term 'unit' used in the above embodiments means a hardware component such as software or FPGA (field programmable gate array) or ASIC, and 'unit' plays a certain role. However, 'something' is not meant to be limited to software or hardware. The 'section' can also be configured to reside on an addressable storage medium and can be configured to run on one or more processors. Thus, by way of example, 'section' refers to components such as software components, object-oriented software components, class components and task components, as well as processes, functions, attributes, procedures, subroutines, segments of program proprietary code, Includes drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables.

構成要素及び‘～部’内で提供される機能はより小さな数の構成要素及び‘～部’と結合するか追加的な構成要素及び‘～部’から分離されることができる。 The functionality provided within a component and 'section' may be combined with a smaller number of components and 'section' or separated from additional components and 'section'.

それだけでなく、構成要素及び’～部’はデバイス又は保安マルチメディアカード内の一つ又はそれ以上のＣＰＵを再生させるように具現されることもできる。 In addition, the components and 'units' can also be embodied to play one or more CPUs in the device or secure multimedia card.

一方、この明細書で説明した一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法は、コンピュータによって実行可能な命令語及びデータを記憶する、コンピュータ可読の媒体の形態にも具現されることができる。ここで、命令語及びデータはプログラムコードの形態として記憶されることができ、プロセッサによって実行されたとき、所定のプログラムモジュールを生成して所定の動作を実行することができる。また、コンピュータ可読の媒体はコンピュータによってアクセス可能な任意の可用媒体であってもよく、揮発性及び非揮発性媒体、分離型及び非分離型媒体のいずれも含む。また、コンピュータ可読の媒体はコンピュータ記録媒体であってもよい。コンピュータ記録媒体はコンピュータ可読の命令語、データ構造、プログラムモジュール又はその他のデータのような情報の記憶のための任意の方法又は技術によって具現された揮発性及び非揮発性、分離型及び非分離型媒体のいずれも含むことができる。例えば、コンピュータ記録媒体は、ＨＤＤ及びＳＳＤなどのマグネチック記憶媒体、ＣＤ、ＤＶＤ及びブルーレイディスクなどの光学的記録媒体、又はネットワークを介して接近可能なサーバーに含まれるメモリであってもよい。 Meanwhile, the method of constructing a transformer model for video story question answering according to one embodiment described herein is also embodied in the form of a computer-readable medium storing computer-executable instructions and data. can Here, instructions and data can be stored in the form of program code, and when executed by a processor, can generate predetermined program modules to perform predetermined operations. Also, computer-readable media can be any available media that can be accessed by the computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, the computer-readable medium may be a computer storage medium. Computer storage media are volatile and non-volatile, separable and non-separable embodied by any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Any of the media can be included. For example, the computer storage medium may be magnetic storage media such as HDD and SSD, optical storage media such as CD, DVD and Blu-ray discs, or memory contained in a server accessible via a network.

また、この明細書で説明した一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法は、コンピュータによって実行可能な命令語を含むコンピュータプログラム（又はコンピュータプログラム商品）で具現されることもできる。コンピュータプログラムはプロセッサによって処理されるプログラミング可能な機械命令語を含み、高レベルプログラミング言語（Ｈｉｇｈ－ｌｅｖｅｌＰｒｏｇｒａｍｍｉｎｇＬａｎｇｕａｇｅ）、オブジェクト指向プログラミング言語（Ｏｂｊｅｃｔ－ｏｒｉｅｎｔｅｄＰｒｏｇｒａｍｍｉｎｇＬａｎｇｕａｇｅ）、アセンブリー言語又は機械言語などで具現されることができる。また、コンピュータプログラムは類型のコンピュータ判読可能記録媒体（例えば、メモリ、ハードディスク、磁気／光学媒体又はＳＳＤ（Ｓｏｌｉｄ－ＳｔａｔｅＤｒｉｖｅ）など）に記録できる。 The method of constructing a transformer model for video story question answering according to one embodiment described herein may also be embodied in a computer program (or computer program product) including computer-executable instructions. can. A computer program includes a programmable machine instruction language processed by a processor, and is embodied in a high-level programming language, an object-oriented programming language, an assembly language, a machine language, or the like. can Also, the computer program can be recorded on any type of computer-readable recording medium (eg, memory, hard disk, magnetic/optical medium, SSD (Solid-State Drive), etc.).

したがって、この明細書で説明した一実施例によるビデオストーリー質問応答のためのトランスフォーマーモデルを構築する方法は、上述したようなコンピュータプログラムがコンピューティング装置によって実行されることによって具現されることができる。コンピューティング装置は、プロセッサと、メモリと、記憶装置と、メモリ及び高速拡張ポートに接続している高速インターフェースと、低速バスと記憶装置に接続している低速インターフェースの少なくとも一部を含むことができる。このような成分のそれぞれは多様なバスを用いて互いに接続されており、共通マザーボードに搭載されるか他の適切な方式で装着できる。 Therefore, a method of building a transformer model for video story question answering according to one embodiment described herein can be implemented by a computer program as described above being executed by a computing device. A computing device may include at least a portion of a processor, a memory, a storage device, a high speed interface connecting to the memory and the high speed expansion port, and a low speed interface connecting to the low speed bus and the storage device. . Each such component is connected to each other using various buses and can be mounted on a common motherboard or attached in any other suitable manner.

ここで、プロセッサはコンピューティング装置内で命令語を処理することができる。このような命令語としては、例えば高速インターフェースに接続されたディスプレイのように外部入力及び出力装置上にＧＵＩ（ＧｒａｐｈｉｃＵｓｅｒＩｎｔｅｒｆａｃｅ）を提供するためのグラフィック情報を表示するためにメモリ又は記憶装置に記憶された命令語を有することができる。他の実施例として、多数のプロセッサ及び／又は多数のバスが適切に多数のメモリ及びメモリ形態と一緒に用いられることができる。また、プロセッサは独立的な多数のアナログ及び／又はデジタルプロセッサを含むチップからなるチップセットトで具現されることができる。 Here, the processor can process instructions within the computing device. Such commands may be stored in a memory or storage device to display graphic information for providing a GUI (Graphic User Interface) on an external input and output device such as a display connected to a high-speed interface. can have a command word As another example, multiple processors and/or multiple buses may suitably be used with multiple memories and memory configurations. Also, the processor may be embodied in a chipset consisting of chips containing multiple independent analog and/or digital processors.

また、メモリはコンピューティング装置内に情報を記憶する。一例として、メモリは揮発性メモリユニット又はそれらの集合で構成されることができる。他の例として、メモリは不揮発性メモリユニット又はそれらの集合で構成されることができる。また、メモリは、例えば磁気又は光ディスクのような他の形態のコンピュータ可読の媒体であってもよい。 The memory also stores information within the computing device. As an example, the memory can consist of volatile memory units or collections thereof. As another example, the memory may consist of non-volatile memory units or collections thereof. The memory may also be other forms of computer-readable media, such as magnetic or optical disks.

そして、記憶装置はコンピューティング装置に大容量の記憶空間を提供することができる。記憶装置はコンピュータ可読の媒体であるかこのような媒体を含む構成であってもよく、例えばＳＡＮ（ＳｔｏｒａｇｅＡｒｅａＮｅｔｗｏｒｋ）内の装置又は他の構成も含むことができ、フロッピーディスク装置、ハードディスク装置、光ディスク装置、又はテープ装置、フラッシュメモリー、それと類似した他の半導体メモリ装置又は装置アレイであってもよい。 And the storage device can provide a large amount of storage space for the computing device. The storage device may be a computer-readable medium or a configuration that includes such a medium, and may include, for example, a device in a SAN (Storage Area Network) or other configuration, such as a floppy disk device, hard disk device, It may be an optical disk drive, or a tape drive, flash memory, or other similar semiconductor memory device or array of devices.

上述した実施例は例示のためのものであり、上述した実施例が属する技術分野の通常の知識を有する者は上述した実施例が有する技術的思想又は必須な特徴を変更しなくて他の具体的な形態に易しく変形可能であることを理解することができるであろう。したがって、上述した実施例は全ての面で例示的なもので、限定的なものではないことを理解しなければならない。例えば、単一型として説明されている各構成要素は分散されて実施されることもでき、同様に分散されたものとして説明されている構成要素も結合された形態に実施されることができる。 The above-described embodiments are for illustrative purposes only, and those skilled in the art to which the above-described embodiments belong may make other specific modifications without changing the technical ideas or essential features of the above-described embodiments. It will be understood that it can be easily transformed into any suitable form. Accordingly, the above-described embodiments are to be understood as illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise, components described as distributed may also be implemented in a combined form.

本明細書によって保護を受けようとする範囲は前記詳細な説明よりは後述する特許請求範囲によって決定され、特許請求範囲の意味及び範囲とその均等な概念から導出される全ての変更又は変形の形態を含むものに解釈されなければならない。 The scope of protection sought by this specification is determined by the claims which follow rather than the foregoing detailed description, and any modifications or variations derived from the meaning and scope of the claims and their equivalents. must be construed to include

３００ビデオストーリー質問応答のためのトランスフォーマーモデルを構築する装置
３１０入出力部
３２０保存部
３３０制御部 300 Apparatus for constructing transformer model for video story question answering 310 input/output unit 320 storage unit 330 control unit

Claims

an input/output unit for receiving video data including a plurality of continuous video clips and question data for video question answering, and for outputting a video story question answering result;
a storage unit for storing programs and data for performing video story Q&A;
a controller comprising at least one processor for building a transformer model for video story question answering by executing the program;
The control unit
A transformer model for video story question answering, wherein a video story is learned from the video data including the plurality of continuous video clips, taking into consideration the contexts of preceding and succeeding video clips adjacent to each other in temporal order. equipment for building

The video clip is
including a plurality of visual tokens and text tokens;
The control unit
Visual tokens and text tokens included in each video clip corresponding to a preset section through each separated encoder are received as input, and the lower hierarchy of the adjacent video clips is received. layer), and utilizing the calculated hidden representation to calculate a representation of the video data considering the context before and after the video storage is learned. The apparatus for building a transformer model for video story question answering according to claim 1.

The control unit
2. The apparatus of claim 1, wherein temporal order is learned using a masked modality model for each video clip. .

4. The apparatus of claim 3, wherein the Masked Modality Model is learned by Negative Contrastive Learning.

A method for building a transformer model for video story question answering performed by a device for building a transformer model for video story question answering, comprising:
receiving video data including a plurality of consecutive video clips and query data for video question answering;
and learning a video story from the video data including the plurality of consecutive video clips by considering the context of the preceding and following video clips adjacent to each other in temporal order. how to build.

The video clip is
including a plurality of visual tokens and text tokens;
The step of learning the video story comprises:
Visual tokens and text tokens included in each video clip corresponding to a preset section through each separated encoder are received as input, and the lower layer of the adjacent video clips is received. ) and training the video storage by computing a contextual representation of the video data utilizing the computed hidden representation. The method of building a transformer model for video story question answering of claim 5, wherein:

The step of learning the video story comprises:
6. The transformer model for video story question answering of claim 5, further comprising learning a temporal order using a masked modality model for each video clip. how to build.

8. The method of building a transformer model for video story question answering of claim 7, wherein the Masked Modality Model is learned by Negative Contrastive Learning.

A computer-readable recording medium recording a program for performing the method of claim 5.

A computer program recorded on a recording medium to be executed by an apparatus for building a transformer model for video story question answering and to perform the method of claim 5.