JP2020149685A

JP2020149685A - Visual question answering model, electronic device, and storage medium

Info

Publication number: JP2020149685A
Application number: JP2020041593A
Authority: JP
Inventors: ジャンフィファン，; Jianhui Huang; ミンキャオ，; Min Qiao; ピンピンファン，; Pingping Huang; ヨンチュウ，; Yong Zhu; ヤジュアンリュウ，; Yajuan Lyu; インリ，; Ying Li
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-03-12
Filing date: 2020-03-11
Publication date: 2020-09-17
Also published as: KR20200110154A; EP3709207A1; KR102403108B1; US20200293921A1; CN109902166A

Abstract

To provide a visual question answering model, an electronic device, and a storage medium, which enable visual question answering by combining image information and text question information.SOLUTION: A visual question answering model is provided, including: a text encoder configured to perform pooling on a word vector sequence of an entered question text to extract a semantic representation vector of the question text; and an image encoder configured to extract an image feature of a given image in combination with the semantic representation vector.SELECTED DRAWING: Figure 1

Description

本発明の実施例は、人工知能技術の分野に関し、詳細には、視覚的質問応答モデル、電子機器、および記憶媒体に関する。 Examples of the present invention relate to the field of artificial intelligence technology, and more specifically to visual question answering models, electronic devices, and storage media.

視覚的質問応答（ＶｉｓｕａｌＱｕｅｓｔｉｏｎＡｎｓｗｅｒｉｎｇ、以下、ＶＱＡと略称する）は、マルチモダリティ融合の典型的なアプリケーションである。例えば、所定画像について、画像には赤い服を着ている打者がいて、「ｗｈａｔｃｏｌｏｒｓｈｉｒｔｉｓｔｈｅｂａｔｔｅｒｗｅａｒｉｎｇ」という関連質問をされると、ＶＱＡシステムは、画像情報とテキスト質問情報とを組み合わせて、回答を「ｒｅｄ」として予測する必要がある。この処理では、主に画像とテキストとのセマンティック特徴の抽出、および抽出された画像とテキストとの２つのモダリティの特徴に対する融合を行うため、ＶＱＡ関連モデルのコーディング部分は、主にテキストエンコーダと画像エンコーダとで構成される。 Visual Question Answering (hereinafter abbreviated as VQA) is a typical application for multi-modality fusion. For example, for a given image, when a batter dressed in red is asked the related question "what color shirt is the batter waering", the VQA system combines the image information with the text question information. , The answer needs to be predicted as "red". In this process, the semantic features of the image and the text are mainly extracted, and the extracted image and the text are fused for the two modality features. Therefore, the coding part of the VQA-related model is mainly the text encoder and the image. It consists of an encoder.

しかしながら、画像エンコーダとテキストエンコーダとの両方を同時に使用する必要があるため、ＶＱＡモデルには、多くのトレーニングする必要があるパラメータが含まれている場合が多いため、モデルのトレーニング時間が非常に長くなる。したがって、モデルの精度が大きく低下せずに工学上よりモデルを簡素化してモデルのトレーニング効率を向上するかは、現在解決する必要がある技術的な問題になっている。 However, because both the image encoder and the text encoder need to be used at the same time, the VQA model often contains a lot of parameters that need to be trained, so the training time of the model is very long. Become. Therefore, whether to simplify the model from the engineering point of view and improve the training efficiency of the model without significantly reducing the accuracy of the model is a technical problem that needs to be solved at present.

本発明の実施例は、視覚的質問応答モデルの精度が大きく低下せずに工学上よりモデルを簡素化して視覚的質問応答モデルのトレーニング効率を向上させることを達成する視覚的質問応答モデル、電子機器および記憶媒体を提供する。 An embodiment of the present invention achieves to improve the training efficiency of a visual question answering model by simplifying the model from an engineering point of view without significantly reducing the accuracy of the visual question answering model. Provides equipment and storage media.

第１の側面では、本発明の実施例は、入力された質問テキストの単語ベクトルシーケンスをプーリング処理して、前記質問テキストのセマンティック表現ベクトルを抽出するためのテキストエンコーダと、前記セマンティック表現ベクトルと組み合わせて所定画像の画像特徴を抽出するための画像エンコーダと、を備える視覚的質問応答モデルを提供する。 In the first aspect, an embodiment of the present invention combines a text encoder for pooling a word vector sequence of input question text to extract a semantic expression vector of the question text and the semantic expression vector. Provided is a visual question-and-answer model including an image encoder for extracting image features of a predetermined image.

第２の側面では、本発明の実施例は、電子機器をさらに提供し、前記電子機器が、１つまたは複数のプロセッサと、１つまたは複数のプログラムを記憶するためのメモリとを備え、前記１つまたは複数のプログラムが前記１つまたは複数のプロセッサによって実行される場合、前記１つまたは複数のプロセッサが本発明のいずれかの実施例に記載の視覚的質問応答モデルを実行する。 In a second aspect, embodiments of the present invention further provide an electronic device, wherein the electronic device comprises one or more processors and a memory for storing one or more programs. When one or more programs are executed by the one or more processors, the one or more processors execute the visual question-and-answer model described in any of the embodiments of the present invention.

第３の態様では、本発明の実施例は、コンピュータプログラムが記憶されているコンピュータ読み取り可能な記憶媒体を提供し、当該プログラムがプロセッサによって実行される場合、本発明のいずれかの実施例に記載の視覚的質問応答モデルが実行される。 In a third aspect, an embodiment of the present invention provides a computer-readable storage medium in which a computer program is stored, and if the program is executed by a processor, it is described in any of the embodiments of the present invention. Visual question answering model is executed.

本発明の実施例は、視覚的質問応答モデル、電子機器、および記憶媒体を提供する。視覚的質問応答モデルは、テキストベクトルをプーリング処理方式でエンコードすることにより、視覚的質問応答モデルを簡素化する目的を達成するとともに、プーリング処理という簡単なコーディング方式により、視覚的質問応答モデルでトレーニングする必要があるパラメータの数を減らし、視覚的質問応答モデルのトレーニング効率を効果的に向上させ、工学的利用に有益である。 The embodiments of the present invention provide a visual question answering model, electronic devices, and storage media. The visual question answering model achieves the purpose of simplifying the visual question answering model by encoding the text vector in a pooling process, and trains in the visual question answering model by a simple coding method called pooling process. It reduces the number of parameters that need to be done, effectively improves the training efficiency of the visual question answering model, and is beneficial for engineering use.

本発明の実施例１に係る視覚的質問応答モデルの概略構成図である。It is a schematic block diagram of the visual question answering model which concerns on Example 1 of this invention. 本発明の実施例２に係る別の視覚的質問応答モデルの概略構成図である。It is a schematic block diagram of another visual question answering model which concerns on Example 2 of this invention. 本発明の実施例３に係る電子機器の概略構成図である。It is a schematic block diagram of the electronic device which concerns on Example 3 of this invention.

以下、図面および実施例を参照して、本発明をさらに詳細に説明する。本明細書に記載される具体的な実施例は、単に本発明を解釈するためのものであり、本発明を限定するものではないことを理解されたい。なお、説明を簡潔にするために、本発明に関連する構成のすべてではなく、一部のみが図面に示されている。 Hereinafter, the present invention will be described in more detail with reference to the drawings and examples. It should be understood that the specific examples described herein are merely for the purpose of interpreting the present invention and are not intended to limit the present invention. For the sake of brevity, only some, but not all, of the configurations relating to the present invention are shown in the drawings.

実施例１
図１は、本発明の実施例１に係る視覚的質問応答モデルである。本実施例は、視覚的質問応答モデルを簡素化することにより、視覚的質問応答モデルのトレーニング効率を向上させ、当該モデルは、コンピュータ端末またはサーバのような電子機器で実行することができる。 Example 1
FIG. 1 is a visual question answering model according to the first embodiment of the present invention. This embodiment improves the training efficiency of the visual question-and-answer model by simplifying the visual question-and-answer model, which can be run on an electronic device such as a computer terminal or server.

図１に示すように、本発明の実施例に係る視覚的質問応答モデルは、入力された質問テキストの単語ベクトルシーケンスをプーリング処理して、前記質問テキストのセマンティック表現ベクトルを抽出するためのテキストエンコーダを備えてもよい。 As shown in FIG. 1, the visual question answering model according to the embodiment of the present invention is a text encoder for pooling the word vector sequence of the input question text and extracting the semantic expression vector of the question text. May be provided.

ここで、質問テキストをエンコードする前に、質問テキストを予め処理する必要がある。例として、質問テキストをｗｏｒｄ２ｖｅｃモデルまたはｇｌｏｖｅモデルで処理して、当該質問テキストに対応する単語ベクトルシーケンスを取得する。質問テキストをエンコードするには、当該質問テキストに対応する単語ベクトルシーケンスをテキストエンコーダに入力し、テキストエンコーダによって質問テキストの単語ベクトルシーケンスをプーリング処理して、質問テキストのセマンティック表現ベクトルを抽出することができる。なお、従来技術では、ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ、長・短期記憶）モデルまたはＢｉ−ＬＳＴＭ（Ｂｉ−ｄｉｒｅｃｔｉｏｎａｌＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ、双方向ＬＳＴＭ）モデルがテキストエンコーダとして使用されているが、本願では、テキストエンコーダとしてＬＳＴＭモデルまたはＢｉ−ＬＳＴＭモデルの代わりにプーリング処理が使用されるので、視覚的質問応答モデルが簡素化される。 Here, the question text needs to be pre-processed before it can be encoded. As an example, the question text is processed by the word2vec model or the grow model to obtain the word vector sequence corresponding to the question text. To encode the question text, the word vector sequence corresponding to the question text can be input to the text encoder, and the word vector sequence of the question text can be pooled by the text encoder to extract the semantic expression vector of the question text. it can. In the prior art, an LSTM (Long Short-Term Memory, long / short-term memory) model or a Bi-LSTM (Bi-directional Long Short-Term Memory, bidirectional LSTM) model is used as a text encoder. In, the pooling process is used instead of the LSTM model or Bi-LSTM model as the text encoder, which simplifies the visual question-and-answer model.

この実施例では、プーリング処理は、最大化プーリング（ｍａｘＰｏｏｌｉｎｇ）処理であり、前記最大化プーリング処理は、次の式で表される。
ｆ（ｗ１，ｗ２，...，ｗｋ）＝ｍａｘ（［ｗ１，ｗ２，...，ｗｋ］，ｄｉｍ＝１） In this embodiment, the pooling process is a max Pooling process, and the maximize pooling process is represented by the following equation.
f (w1, w2, ..., wk) = max ([w1, w2, ..., wk], dim = 1)

ただし、ｆは最大化プーリング処理関数を表し、ｋは前記質問テキストに含まれる単語ベクトルの数であり、ｗｉは、予めトレーニングされた単語ベクトルモデルを利用して前記質問テキストを処理することによって得られたｉ番目の単語ベクトルであり、ｉは[１，ｋ]内の自然数であり、ｍａｘ（[ｗ１，ｗ２，...、ｗｋ]，dｉｍ＝１）は単語ベクトルｗ１，ｗ２，...，ｗｋにおける各単語ベクトルに対応する次元の最大値を表し、、dｉｍ＝１は次元を指し、つまり、所定の２次元行列について、行ごとに値を取ることを表す。 Where f represents the maximized pooling processing function, k is the number of word vectors contained in the question text, and wi is obtained by processing the question text using a pre-trained word vector model. Is the i-th word vector obtained, i is a natural number in [1, k], and max ([w1, w2, ..., wk], dim = 1) is the word vector w1, w2, .. It represents the maximum value of the dimension corresponding to each word vector in., Wk, and dim = 1 indicates a dimension, that is, it represents taking a value for each row in a predetermined two-dimensional matrix.

例として、１つの質問テキストの単語ベクトルシーケンスは、

であり、上記の式によって当該単語ベクトルシーケンスを最大化プーリング処理して、

を得るため、

は、当該質問テキストのセマンティック表現ベクトルである。したがって、最大化プーリング処理することにより、視覚的質問応答モデルにおけるトレーニングする必要があるパラメータの数を削減し、視覚的質問応答モデルのトレーニング効率を向上させることができる。 As an example, a word vector sequence of one question text

The word vector sequence is maximized and pooled by the above equation.

To get

Is the semantic representation vector of the question text. Therefore, the maximized pooling process can reduce the number of parameters that need to be trained in the visual question answering model and improve the training efficiency of the visual question answering model.

また、本発明の実施例の視覚的質問応答モデルにおける画像エンコーダは、セマンティック表現ベクトルと組み合わせて所定画像の画像特徴を抽出するために使用される。 In addition, the image encoder in the visual question answering model of the embodiment of the present invention is used in combination with a semantic expression vector to extract image features of a predetermined image.

画像には背景および豊富なコンテンツが含まれているため、マシンが質問に関連する画像コンテンツに注意を払うことを確保し、質問の回答の精度を向上させるために、視覚的注意力メカニズム（図１中のＡｔｔｅｎｔｉｏｎ）を採用することができる。Ａｔｔｅｎｔｉｏｎメカニズムを介して、画像エンコーダは、テキストエンコーダによって取得された質問テキストに対応するセマンティック表現ベクトルを組み合わせることにより、当該セマンティック表現ベクトルに最も関連する画像コンテンツを絞り、当該画像コンテンツの画像特徴を抽出して画像特徴ベクトルを取得することができる。ＦａｓｔｅｒＲＣＮＮモデルのような畳み込みニューラルネットワークモデルを採用することができる。 The image contains background and rich content, so a visual attention mechanism (figure) to ensure that the machine pays attention to the image content related to the question and to improve the accuracy of answering the question. Attention in 1) can be adopted. Through the attachment mechanism, the image encoder narrows down the image content most related to the semantic expression vector by combining the semantic expression vectors corresponding to the question text acquired by the text encoder, and extracts the image features of the image content. The image feature vector can be obtained. A convolutional neural network model such as the Faster RCNN model can be adopted.

さらに、図１に示すように、当該視覚的質問応答モデルは、異なるモダリティの特徴を融合するための特徴融合器（ｆｕｓｉｏｎ）をさらに備え、この実施例では、特徴融合器は、画像エンコーダによって出力された画像特徴ベクトルとテキストエンコーダによって出力されたセマンティック表現ベクトルとを融合する。例として、画像特徴ベクトルとセマンティック表現ベクトルとをドット積によって融合することができる。 Further, as shown in FIG. 1, the visual question-and-answer model further comprises a feature fusion for fusing features of different modality, and in this embodiment, the feature fusion is output by an image encoder. The resulting image feature vector and the semantic representation vector output by the text encoder are fused. As an example, the image feature vector and the semantic representation vector can be fused by the dot product.

当該視覚的質問応答モデルは、分類器をさらに備え、前記分類器は、上記の特徴融合器によって出力されたベクトルをｓｏｆｔｍａｘ関数（正規化指数関数とも呼ばれる）によって数値的に処理して、異なる回答間の相対確率を取得し、相対確率最大値に対応する回答を出力する。 The visual question-and-answer model further comprises a classifier, which numerically processes the vector output by the feature fusion device by a softmax function (also called a normalized exponential function) to provide different answers. The relative probability between them is acquired, and the answer corresponding to the maximum relative probability is output.

上記の視覚的質問応答モデルについて、具体的な一実施形態において、スタンフォード人工知能研究所によってリリースされたデータセットＶｉｓｕａｌＧｅｎｏｍｅをトレーニングサンプルデータおよび検証データとし、トレーニングサンプルデータおよび検証データを２：１の比例でランダムに配分して、当該視覚的質問応答モデルをトレーニングおよび検証することができる。当該データセットの具体的なデータ統計を表１に示す。各画像には一定数の質問が含まれ、所定回答が人工によってラベル付けられる。 Regarding the above visual question answering model, in one specific embodiment, the dataset Visual Genome released by Stanford University Centers for Artificial Intelligence is used as training sample data and validation data, and the training sample data and validation data are 2: 1. The visual question answering model can be trained and validated in proportion and randomly distributed. Table 1 shows specific data statistics for the dataset. Each image contains a fixed number of questions and a given answer is artificially labeled.

上記のデータを使用して本実施例に係る視覚的質問応答モデルをトレーニングおよび検証する。具体的には、Ｐ４０クラスタで当該視覚的質問応答モデルを実行することができ、Ｐ４０クラスタの環境構成およびモデルの基本パラメータを表２に示す。比較のために、同時にＬＳＴＭまたはＢｉ-ＬＳＴＭをテキストエンコーダとした従来技術の視覚的質問応答モデルをトレーニングおよび検証し、結果を表３に示す。 The above data will be used to train and validate the visual question answering model for this example. Specifically, the visual question answering model can be executed in the P40 cluster, and Table 2 shows the environment configuration of the P40 cluster and the basic parameters of the model. For comparison, we also trained and validated a prior art visual question answering model using LSTM or Bi-LSTM as a text encoder, and the results are shown in Table 3.

表３に示された検証結果から、テキストエンコーダとして最大化プーリング処理を採用する本発明の実施例の視覚的質問応答モデルは、テキストエンコーダとしてＬＳＴＭまたはＢｉ-ＬＳＴＭを採用する従来の視覚的質問応答モデルと比較して、予測精度が約０.５％だけ低下するが、モデルの実行時間が最大３時間短縮され、トレーニング効率が大幅に向上することがわかる。 From the verification results shown in Table 3, the visual question-and-answer model of the embodiment of the present invention that employs the maximized pooling process as the text encoder is a conventional visual question-and-answer model that employs LSTM or Bi-LSTM as the text encoder. It can be seen that the prediction accuracy is reduced by about 0.5% as compared with the model, but the execution time of the model is shortened by up to 3 hours, and the training efficiency is greatly improved.

本発明の実施例では、視覚的質問応答モデルは、テキストベクトルをプーリング処理方式でエンコードし、視覚的質問応答モデルを簡素化する目的を達成するとともに、プーリング処理という簡単なエンコーディング方式によって、視覚的質問応答モデルの予測精度が大きく低下せずに視覚的質問応答モデルのトレーニング効率を効果的に向上させることが実現され、工学的利用に有益である。 In the embodiment of the present invention, the visual question answering model encodes a text vector by a pooling process to achieve the purpose of simplifying the visual question answering model, and visually by a simple encoding method called pooling. It is possible to effectively improve the training efficiency of the visual question answering model without significantly reducing the prediction accuracy of the question answering model, which is useful for engineering use.

実施例２
図２は、この実施例に係る別の視覚的質問応答モデルの概略構成図である。図２に示すように、視覚的質問応答モデルは、入力された質問テキストの単語ベクトルシーケンスをプーリング処理して、前記質問テキストのセマンティック表現ベクトルを抽出するためのテキストエンコーダを備える。 Example 2
FIG. 2 is a schematic configuration diagram of another visual question answering model according to this embodiment. As shown in FIG. 2, the visual question answering model includes a text encoder for pooling a word vector sequence of input question text to extract a semantic representation vector of the question text.

ここで、前記プーリング処理は、平均化プーリング処理であり、前記平均化プーリング処理（ａｖｇＰｏｏｌｉｎｇ）は、次の式で表すことができる。

ただし、ｐは平均化プーリング処理関数を表し、ｋは前記質問テキストに含まれる単語ベクトルの数であり、ｗｉは予めトレーニングされた単語ベクトルモデルを利用して前記質問テキストを処理することによって得られたｉ番目の単語ベクトルであり、ｉは[１，ｋ]内の自然数であり、

は、単語ベクトルｗ１，ｗ２，...，ｗｋにおける各単語ベクトルに対応する次元の値の合計を表す。 Here, the pooling process is an averaging pooling process, and the averaging pooling process (avgPooling) can be expressed by the following equation.

Where p represents the averaging pooling processing function, k is the number of word vectors contained in the question text, and wi is obtained by processing the question text using a pre-trained word vector model. Is the i-th word vector, i is a natural number in [1, k],

Represents the sum of the dimensional values corresponding to each word vector in the word vectors w1, w2, ..., Wk.

であり、上記の式によって当該単語ベクトルシーケンスを平均化プーリング処理して

を得るので、

は当該質問テキストのセマンティック表現ベクトルである。したがって、平均化プーリング処理によって、視覚的質問応答モデルにおけるトレーニングする必要があるパラメータの数を減らし、視覚的質問応答モデルのトレーニング効率を向上させることができる。 As an example, a word vector sequence of one question text

The word vector sequence is averaged and pooled by the above formula.

Because you get

Is the semantic representation vector of the question text. Therefore, the averaging pooling process can reduce the number of parameters that need to be trained in the visual question answering model and improve the training efficiency of the visual question answering model.

さらに、視覚的質問応答モデルは、特徴融合器および分類器をさらに備え、前記特徴融合器および分類器の詳細については、前述した実施例を参照し、詳細はここでは再度説明しない。 Further, the visual question answering model further comprises a feature fusion device and a classifier, the details of the feature fusion device and the classifier will be referred to the above-described embodiment, and the details will not be described again here.

本実施例の視覚的質問応答モデルについて、前述した実施例のＶｉｓｕａｌＧｅｎｏｍｅデータセットを上記実施例で記載されたＰ４０クラスタでトレーニングおよび検証し、同時にＬＳＴＭまたはＢｉ-ＬＳＴＭをテキストエンコーダとした従来技術の視覚的質問応答モデルをトレーニングおよび検証し、結果を表４に示す。 For the visual question answering model of this example, the Visual Genome dataset of the above-described example was trained and verified in the P40 cluster described in the above example, and at the same time, the visual sense of the prior art using LSTM or Bi-LSTM as a text encoder. The question answering model was trained and validated, and the results are shown in Table 4.

表４から、テキストエンコーダとして平均化プーリング処理を採用する本発明の実施例の視覚的質問応答モデルは、テキストエンコーダとしてＬＳＴＭまたはＢｉ-ＬＳＴＭを採用する従来の視覚的質問応答モデルと比較して、予測精度が約０.４％だけ低下するが、モデルの実行時間が最大２．４時間短縮され、トレーニング効率が大幅に向上することがわかる。 From Table 4, the visual question-and-answer model of the embodiment of the present invention that employs the averaging pooling process as the text encoder is compared with the conventional visual question-and-answer model that employs LSTM or Bi-LSTM as the text encoder. It can be seen that the prediction accuracy is reduced by about 0.4%, but the model execution time is shortened by up to 2.4 hours, and the training efficiency is greatly improved.

本発明の実施例では、視覚的質問応答モデルは、テキストベクトルを平均化プーリング処理方式でエンコードし、視覚的質問応答モデルを簡素化する目的を達成するとともに、平均化プーリング処理という簡単なエンコーディング方式によって、視覚的質問応答モデルの予測精度が大きく低下せずに視覚的質問応答モデルのトレーニング効率を効果的に向上させることが実現され、工学的利用に有益である。 In the embodiment of the present invention, the visual question answering model encodes a text vector by an averaging pooling process to achieve the purpose of simplifying the visual question answering model, and a simple encoding method called averaging pooling process. As a result, it is possible to effectively improve the training efficiency of the visual question answering model without significantly reducing the prediction accuracy of the visual question answering model, which is useful for engineering use.

実施例３
図３は、本発明の実施例３に係る電子機器の概略構成図である。図３は、本発明の実施形態の実現に適する例示的な電子機器１２のブロック図を示している。図３に示される電子機器１２は単なる例であり、本願の実施例の機能および使用の範囲にいかなる制限もすべきではない。 Example 3
FIG. 3 is a schematic configuration diagram of an electronic device according to a third embodiment of the present invention. FIG. 3 shows a block diagram of an exemplary electronic device 12 suitable for realizing the embodiment of the present invention. The electronic device 12 shown in FIG. 3 is merely an example and should not limit the functionality and scope of use of the embodiments of the present application.

図３に示すように、電子機器１２は、汎用コンピューティング機器の形態で示されている。電子機器１２の構成要素は、１つまたは複数のプロセッサまたはプロセッサ１６と、メモリ２８と、異なるシステムの構成要素（メモリ２８とプロセッサ１２６とを備える）を接続するバス１８とを備えるが、これらに限定されない。 As shown in FIG. 3, the electronic device 12 is shown in the form of a general-purpose computing device. The components of the electronic device 12 include one or more processors or processors 16, a memory 28, and a bus 18 connecting components of different systems (including the memory 28 and the processor 126). Not limited.

バス１８は、いくつかのタイプのバス構造のうちの１つまたは複数を表し、メモリバスまたはメモリコントローラ、周辺バス、アクセラレーテッドグラフィックスポート、プロセッサ、または多様なバス構造のいずれかのバス構造を使用するローカルバスを含む。例えば、これらのアーキテクチャは、インダストリスタンダードアーキテクチャ（ＩＳＡ）バス、マイクロチャネルアーキテクチャ（ＭＣＡ）バス、拡張ＩＳＡバス、ビデオエレクトロニクススタンダーズアソシエーション（ＶＥＳＡ）ローカルバス、およびペリフェラルコンポーネントインターコネクト（ＰＣＩ）バスを含むが、これらに限定されない。 Bus 18 represents one or more of several types of bus structures, one of which is a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a variety of bus structures. Includes the local bus to use. For example, these architectures include Industry Standard Architecture (ISA) Bus, Micro Channel Architecture (MCA) Bus, Extended ISA Bus, Video Electronics Standards Association (VESA) Local Bus, and Peripheral Component Interconnect (PCI) Bus. Not limited to these.

電子機器１２は、通常、複数種類のコンピュータシステム読み取り可能な媒体を含む。これらの媒体は、揮発性媒体および不揮発性媒体、リムーバブル媒体およびノンリムーバブル媒体を含む、電子機器１２によってアクセスされ得る任意の使用可能な媒体であってもよい。 The electronic device 12 usually includes a plurality of types of computer system readable media. These media may be any usable medium accessible by the electronic device 12, including volatile and non-volatile media, removable and non-removable media.

メモリ２８は、ランダムアクセスメモリ（ＲＡＭ）３０および／またはキャッシュメモリ３２のような揮発性メモリの形態のコンピュータシステム読み取り可能な媒体を備えてもよい。電子機器１２は、他のリムーバブル／ノンリムーバブル、揮発性／不揮発性コンピュータシステム記憶媒体をさらに備えてもよい。例だけとするが、ストレージシステム３４は、ノンリムーバブル、不揮発性磁気媒体（図３に図示せず、通常「ハードディスクドライバ」という）に対して読み出しおよび書き込みをするために用いることができる。図３に示されていないが、リムーバブル不揮発性磁気ディスク（例えば、「フロッピーディスク」）に対して読み出しおよび書き込みをするための磁気ディスクドライバ、およびリムーバブル不揮発性光学ディスク（例えば、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭまたは他の光学媒体）に対して読み出しおよび書き込みをするための光学ディスクドライバを提供することができる。これらの場合、各ドライバは、１つまたは複数のデータメディアインターフェイスを介してバス１８に接続することができる。メモリ２８は、本開示の各実施例に記載の機能を実行するように構成される１セットの（例えば、少なくとも１つ）プログラムモジュールを有する少なくとも１つのプログラム製品を備えてもよい。 The memory 28 may include computer system readable media in the form of volatile memory such as random access memory (RAM) 30 and / or cache memory 32. The electronic device 12 may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, the storage system 34 can be used to read and write to a non-removable, non-volatile magnetic medium (not shown in FIG. 3, usually referred to as a "hard disk driver"). Although not shown in FIG. 3, a magnetic disk driver for reading and writing to a removable non-volatile magnetic disk (eg, "floppy disk") and a removable non-volatile optical disk (eg, CD-ROM, DVD). -An optical disk driver for reading and writing to (ROM or other optical medium) can be provided. In these cases, each driver can be connected to bus 18 via one or more data media interfaces. The memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions described in each embodiment of the present disclosure.

１セットの（少なくとも１つ）プログラムモジュール４２を有するプログラム/ユーティリティ４０は、例えば、メモリ２８に記憶されてもよく、このようなプログラムモジュール４２は、オペレーティングシステム、１つまたは複数のアプリケーションプログラム、他のプログラムモジュールおよびプログラムデータを含むが、これらに限定されない。これらの例のそれぞれまたはある組み合わせにはネットワーキング環境の実現が含まれる可能性がある。プログラムモジュール４２は、通常、本開示に記載の実施例における機能および／または方法を実行する。 A program / utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 as operating systems, one or more application programs, etc. Includes, but is not limited to, program modules and program data. Each or some combination of these examples may include the realization of a networking environment. Program module 42 typically performs the functions and / or methods of the embodiments described in the present disclosure.

電子機器１２は、１つまたは複数の外部デバイス２００（例えば、キーボード、ポインティングデバイス、ディスプレイ２４など）と通信することができ、また、ユーザが当該電子機器１２とインタラクションすることを可能にする１つまたは複数のデバイスと通信することができ、および／または、当該電子機器１２が１つまたは複数の他のコンピューティングデバイスと通信することを可能にする任意のデバイス（例えば、ネットワークカード、モデムなど）と通信することもできる。そのような通信は、入力／出力（Ｉ／Ｏ）インターフェイス２２を介して行うことができる。また、電子機器１２は、ネットワークアダプタ２０を介して、１つまたは複数のネットワーク（例えば、ローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（ＷＡＮ）、および／またはインターネットのようなパブリックネットワーク）と通信することができる。図に示すように、ネットワークアダプタ２０は、バス１８を介して電子機器１２の他のモジュールと通信する。なお、図に示されていないが、マイクロコード、デバイスドライバ、冗長化プロセッサ、外部ディスクドライブアレイ、ＲＡＩＤシステム、テープドライバ、およびデータバックアップストレージシステムなどを含むが、これらに限定されない他のハードウェアおよび／またはソフトウェアモジュールを電子機器１２と組み合わせて使用することができる。 The electronic device 12 is one that can communicate with one or more external devices 200 (eg, keyboard, pointing device, display 24, etc.) and also allows the user to interact with the electronic device 12. Or any device that can communicate with multiple devices and / or allow the electronic device 12 to communicate with one or more other computing devices (eg, network cards, modems, etc.). You can also communicate with. Such communication can be done via the input / output (I / O) interface 22. The electronic device 12 also communicates with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and / or a public network such as the Internet) via a network adapter 20. be able to. As shown in the figure, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. Other hardware and other hardware not shown in the figure, including, but not limited to, microcodes, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drivers, and data backup storage systems. / Or the software module can be used in combination with the electronic device 12.

プロセッサ１６は、メモリ２８に記憶されているプログラムを実行することにより、多様な機能アプリケーションおよびデータ処理を実行し、例えば、前述した実施例に係る視覚的質問応答モデルを実現し、当該視覚的質問応答モデルは、入力された質問テキストの単語ベクトルシーケンスをプーリング処理して、前記質問テキストのセマンティック表現ベクトルを抽出するためのテキストエンコーダと、前記セマンティック表現ベクトルと組み合わせて所定画像の画像特徴を抽出するための画像エンコーダと、を備える。 The processor 16 executes various functional applications and data processing by executing a program stored in the memory 28, realizes, for example, a visual question answering model according to the above-described embodiment, and obtains the visual question. The response model pools the word vector sequence of the input question text and extracts the image features of the predetermined image in combination with the text encoder for extracting the semantic expression vector of the question text and the semantic expression vector. It is equipped with an image encoder for.

実施例４
本発明の実施例４は、コンピュータ読み取り可能な記憶媒体を提供し、当該コンピュータ読み取り可能な記憶媒体は、本発明の実施例に係る視覚的質問応答モデルを記憶し、コンピュータプロセッサによって実行される。前記視覚的質問応答モデルは、入力された質問テキストの単語ベクトルシーケンスをプーリング処理して、前記質問テキストのセマンティック表現ベクトルを抽出するためのテキストエンコーダと、前記セマンティック表現ベクトルと組み合わせて所定画像の画像特徴を抽出するための画像エンコーダと、を備える。 Example 4
Example 4 of the present invention provides a computer-readable storage medium, which stores a visual question-and-answer model according to an embodiment of the present invention and is executed by a computer processor. The visual question answering model pools a word vector sequence of input question text and combines it with a text encoder for extracting a semantic expression vector of the question text and an image of a predetermined image. It includes an image encoder for extracting features.

勿論、本発明の実施例で提供されるコンピュータ読み取り可能な記憶媒体は、本発明の任意の実施例で提供される視覚的質問応答モデルを実行することもできる。 Of course, the computer-readable storage medium provided in the embodiments of the present invention can also carry out the visual question answering model provided in any of the embodiments of the present invention.

本発明の実施例のコンピュータ記憶媒体は、１つまたは複数のコンピュータ読み取り可能な媒体の任意の組み合わせを使用することができる。コンピュータ読み取り可能な媒体は、コンピュータ読み取り可能な信号媒体またはコンピュータ読み取り可能な記憶媒体であり得る。コンピュータ読み取り可能な記憶媒体は、例えば、電子、磁気、光学、電磁気、赤外線、または半導体システム、装置、またはデバイス、または上記の任意の組み合わせであり得るが、これらに限定されない。コンピュータ読み取り可能な記憶媒体のより具体的な例（非網羅的なリスト）は、１つまたは複数のリード線を備えた電気接続、ポータブルコンピュータディスク、ハードディスク、ランダムアクセスメモリ（ＲＡＭ）、読み出し専用メモリ（ＲＯＭ）、消去可能なプログラマブル読み出し専用メモリ（ＥＰＲＯＭまたはフラッシュメモリ）、光ファイバー、ポータブルコンパクトディスク読み出し専用メモリ（ＣＤーＲＯＭ）、光学記憶装置、磁気記憶装置、または上記の任意の適切な組み合わせを含む。本明細書では、コンピュータ読み取り可能な記憶媒体は、命令実行システム、装置、またはデバイスによって使用され、またはそれらと組み合わせて使用できるプログラムを含む、または格納できる任意の有形の媒体であり得る。 As the computer storage medium of the embodiment of the present invention, any combination of one or more computer-readable media can be used. The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium can be, but is not limited to, for example, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any combination of the above. More specific examples (non-exhaustive lists) of computer-readable storage media are electrical connections with one or more leads, portable computer disks, hard disks, random access memory (RAM), and read-only memory. Includes (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the above. .. As used herein, a computer-readable storage medium can be any tangible medium that can contain or store programs that can be used by, or combined with, instruction execution systems, devices, or devices.

コンピュータ読み取り可能なの信号媒体は、ベースバンドにおける、または搬送波の一部として伝播するデータ信号を含むことができ、その中にはコンピュータ読み取り可能なプログラムコードが含まれる。この伝播するデータ信号は様々な形式を採用することができ、電磁信号、光信号または上記の任意の適切な組み合わせを含むがこれらに限定されない。さらに、コンピュータ読み取り可能なの信号媒体は、コンピュータ読み取り可能な記憶媒体以外の任意のコンピュータ読み取り可能な媒体であってもよく、当該コンピュータ読み取り可能な媒体は、命令実行システム、装置またはデバイスにより使用され、或いはそれらと組み合わせて使用されるプログラムを送信、伝播または伝送することができる。 Computer-readable signal media can include data signals propagating in the baseband or as part of a carrier wave, including computer-readable program code. The propagating data signal can adopt a variety of formats, including but not limited to electromagnetic signals, optical signals or any suitable combination of the above. Further, the computer-readable signal medium may be any computer-readable medium other than the computer-readable storage medium, which is used by the instruction execution system, device or device. Alternatively, a program used in combination with them can be transmitted, propagated or transmitted.

コンピュータ読み取り可能な媒体に含まれるプログラムコードは、無線、有線、光ケーブル、ＲＦなど、または上記の任意の適切な組み合わせを含むがこれらに限定されない任意の適切な媒体によって伝送することができる。 The program code contained in the computer-readable medium can be transmitted by any suitable medium including, but not limited to, wireless, wired, optical cable, RF, etc., or any suitable combination described above.

１つまたは複数のプログラミング言語またはそれらの組み合わせで本発明の動作を実行するためのコンピュータプログラムコードを作成することができ、前記プログラミング言語は、Ｊａｖａ（登録商標）、Ｓｍａｌｌｔａｌｋ、Ｃ＋＋などのプロジェクト指向のプログラミング言語を含み、「Ｃ」言語または類似のプログラミング言語のような従来の手続き型プログラミング言語をさらに含む。プログラムコードは、完全にユーザーコンピュータで実行されてもよいし、部分的にユーザーコンピュータに実行されてもよいし、スタンドアロンソフトウェアパッケージとして実行されてもよいし、部分的にユーザーコンピュータで、部分的にリモートコンピュータで実行されてもよい、または完全にリモートコンピュータまたはサーバーで実行してもよい。リモートコンピュータの場合、リモートコンピュータは、ローカルエリアネットワーク（ＬＡＮ）またはワイドエリアネットワーク（ＷＡＮ）を含む任意の種類のネットワークを介してユーザーのコンピュータに接続でき、または、外部コンピュータに接続できる（例えば、インターネットサービスプロバイダを利用して、インターネット経由で接続する）。 Computer programming code for performing the operations of the present invention can be created in one or more programming languages or a combination thereof, the programming languages being project-oriented such as Java®, Smalltalk, C ++. Includes programming languages, further including traditional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user computer, partially on the user computer, as a stand-alone software package, or partially on the user computer. It may run on a remote computer, or it may run entirely on a remote computer or server. For remote computers, the remote computer can connect to the user's computer or to an external computer (eg, the Internet) over any type of network, including local area networks (LANs) or wide area networks (WANs). Connect via the Internet using a service provider).

なお、上記は、本発明の好ましい実施例およびそれらに適用される技術的原理に過ぎないことに留意されたい。当業者は、本発明が本明細書に記載の特定の実施例に限定されず、本発明の範囲から逸脱することなく様々な変形、再調整、および置き換えを行うことができることを理解することができる。したがって、本発明を上記実施例により詳細に説明したが、本発明は上記実施例に限定されるものではなく、本発明の趣旨を逸脱しない範囲で同等の実施例を含むことができる。本発明の範囲は、特許請求の範囲によって決定される。 It should be noted that the above are merely preferred embodiments of the present invention and technical principles applied thereto. Those skilled in the art will appreciate that the invention is not limited to the particular embodiments described herein and that various modifications, readjustments, and replacements can be made without departing from the scope of the invention. it can. Therefore, although the present invention has been described in detail with reference to the above examples, the present invention is not limited to the above examples, and equivalent examples can be included without departing from the spirit of the present invention. The scope of the present invention is determined by the scope of claims.

Claims

A text encoder for pooling the word vector sequence of the input question text to extract the semantic representation vector of the question text,
An image encoder for extracting image features of a predetermined image in combination with the semantic expression vector,
A visual question answering model with.

Specifically, the text encoder
The model according to claim 1, wherein a semantic expression vector of the question text is extracted by maximizing pooling or averaging the word vector sequence of the input question text.

The model according to claim 2, wherein the maximized pooling process is represented by the following equation.
f (w1, w2, ..., wk) = max ([w1, w2, ..., wk], dim = 1)
Where f represents the maximized pooling processing function, k is the number of word vectors contained in the question text, and wi is obtained by processing the question text using a pre-trained word vector model. Is the i-th word vector obtained, i is a natural number in [1, k], and max ([w1, w2, ..., wk], dim = 1) is the word vector w1, w2, .. ., Represents the maximum value of the dimension corresponding to each word vector in wk.

The model according to claim 2, wherein the average pooling process is represented by the following formula.

With one or more processors
An electronic device comprising a memory for storing one or more programs.
When the one or more programs are executed by the one or more processors, the one or more processors execute the visual question-and-answer model according to any one of claims 1 to 4. Electronics.

A computer-readable storage medium that stores computer programs
A computer-readable storage medium on which the visual question answering model according to any one of claims 1 to 4 is executed when the program is executed by a processor.