WO2024042650A1 - Training device, training method, and program - Google Patents

Training device, training method, and program Download PDF

Info

Publication number
WO2024042650A1
WO2024042650A1 PCT/JP2022/031921 JP2022031921W WO2024042650A1 WO 2024042650 A1 WO2024042650 A1 WO 2024042650A1 JP 2022031921 W JP2022031921 W JP 2022031921W WO 2024042650 A1 WO2024042650 A1 WO 2024042650A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
text
feature amount
model
loss
Prior art date
Application number
PCT/JP2022/031921
Other languages
French (fr)
Japanese (ja)
Inventor
拓 長谷川
京介 西田
いつみ 斉藤
仙 吉田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/031921 priority Critical patent/WO2024042650A1/en
Publication of WO2024042650A1 publication Critical patent/WO2024042650A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a learning device, a learning method, and a program.
  • Patent Document 1 Non-Patent Document 2
  • image recognition models and language models are prepared respectively, and these are used to learn the network parameters of neural networks on large-scale datasets consisting of related image and text pairs. do. This makes it possible to embed related images and text close together in the same space, and has been evaluated in the field of vision and language, which utilizes visual and textual information.
  • Non-Patent Document 1 the accuracy of recognizing characters written in images is too high, so they intentionally write unrelated characters (correct answers to images of training data given during learning). It has been reported that if characters that are not included in the associated text are embedded in an image, a strong reaction will occur to unrelated characters, making it impossible to correctly recognize the visual information in the image. The results of this phenomenon vary depending on the size and position of the text in the image, but it suggests that the desired visual information may not be obtained from naturally occurring text information such as corporate logos and signboards, as well as malicious text embedding. This misidentification may cause problems in actual operation.
  • the system may react strongly to the unrelated characters and may not be able to correctly recognize the visual information in the image.
  • the present invention has been made in view of the above points, and it is an object of the present invention to provide a text and image embedding model that reduces the influence of unrelated character strings embedded in images.
  • the learning device includes a first acquisition unit that acquires feature quantities of a plurality of texts using a first model, and a first acquisition unit that acquires feature quantities of a plurality of texts using a first model, and a learning device that determines, for each text, a positive example of the relevance of the text.
  • a second method that uses a second model to obtain the feature amount of the first image and the feature amount of a second image in which a character string not included in the text is embedded in the first image.
  • a loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and based on the loss, the first model and the and a learning unit that updates parameters of the second model.
  • FIG. 1 is a diagram showing an example of a hardware configuration of a search device 10 according to an embodiment of the present invention.
  • 1 is a diagram showing an example of a functional configuration of a search device 10 according to an embodiment of the present invention.
  • 3 is a flowchart for explaining an example of a processing procedure of model parameter learning processing.
  • FIG. 6 is a diagram for explaining a method of calculating the output value and loss of a softmax function for text and images for each of a positive example and a negative example.
  • a search device 10 that executes a search task will be described, taking as an example a search task of extracting an image related to a search text query when an image to be searched is given.
  • FIG. 1 is a diagram showing an example of the hardware configuration of a search device 10 in an embodiment of the present invention.
  • the search device 10 in FIG. 1 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, etc., which are interconnected by a bus B.
  • a program that realizes the processing in the search device 10 is provided by a recording medium 101 such as a CD-ROM.
  • a recording medium 101 such as a CD-ROM.
  • the program is installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100.
  • the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via a network.
  • the auxiliary storage device 102 stores installed programs as well as necessary files, data, and the like.
  • the memory device 103 reads and stores the program from the auxiliary storage device 102 when there is an instruction to start the program.
  • the processor 104 is a CPU, a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the search device 10 according to a program stored in the memory device 103.
  • the interface device 105 is used as an interface for connecting to a network.
  • FIG. 2 is a diagram showing an example of the functional configuration of the search device 10 according to the embodiment of the present invention.
  • the search device 10 includes a search section 11 and a model learning section 12. Each of these units is realized by one or more programs installed in the search device 10 causing the processor 104 to execute the process.
  • the search unit 11 executes a search task.
  • the output from the search unit 11 is an ordered set ⁇ I 0 , I 1 , ..., I k ⁇ of images related to the search query Q (hereinafter referred to as "related images”) and the degree of relevance of each related image to Q ⁇ S 0 , S 1 , ..., S k ⁇ .
  • related images images related to the search query Q
  • m is the number of images to be searched
  • k is the number of related images obtained by the search.
  • the search unit 11 includes a context encoding unit 111, an image encoding unit 112, and a ranking unit 113.
  • the context encoding unit 111 and the image encoding unit 112 are implemented using a neural network. All arithmetic processing in the neural network is performed based on learned parameters corresponding to each model (neural network).
  • the context encoding unit 111 inputs a character string constituting an arbitrary sentence as a search query Q, and outputs (generates) a feature quantity u of the search query based on the parameters of the learned model as the context encoding unit 111. do.
  • a specific neural network model as the context encoding unit 111 is not limited to a specific model as long as it encodes text information.
  • the text encoder model used in Non-Patent Document 1 may be used. Although this model inputs text and outputs d-dimensional feature amounts, other models may be used as long as they are context-sensitive pre-learning models using transformers.
  • a specific neural network model for the image encoding unit 112 is not limited to a specific model as long as it receives an image as an input and outputs a d-dimensional vector. However, the dimensions d of the output vectors of the context encoding unit 111 and the image encoding unit 112 need to match.
  • the image encoder model used in Non-Patent Document 1 may be used.
  • ResNet and ViT are prepared as models that input an image and output a d-dimensional feature amount. These or other models may be used.
  • the ranking unit 113 includes a feature quantity u outputted from the context encoding unit 111 regarding the search query Q, and a set of feature quantities v i outputted from the image encoding unit 112 regarding each search target image I i (hereinafter referred to as “feature ) ⁇ v 1 ,..., v i ,..., v m ⁇ is input, and an ordered set of related images of the search query Q ⁇ I 1 ,..., I i ,..., I k ⁇ and each The degree of association ⁇ S 1 , ..., S i , ..., S k ⁇ with respect to Q of the related images is output.
  • f may be the reciprocal of the inner product of u and v i .
  • the inputs of f are two vectors of the same dimension, and the output is a scalar.
  • a distance function that can measure the distance between vectors may be used as f.
  • the model learning unit 12 learns model parameters of the context encoding unit 111 and the model learning unit 12.
  • training data for the search task is collected in advance.
  • one image (positive example image) randomly extracted from the image set I i related to each T i is set as I i + , and (T i , I i + ) is Learning data is created as one set.
  • the learning data prepared in advance is a set of text and positive example images.
  • the model learning unit 12 updates the model parameters of the context encoding unit 111 and the image encoding unit 112 through supervised learning using such learning data. It is assumed that the model parameters of the context encoding unit 111 and the image encoding unit 112 are initialized in advance with appropriate initial values (when using the model structure of Non-Patent Document 1, the model parameters of the context encoding unit 111 and the image encoding unit 112 are Parameters of an existing trained model may be used as parameters.)
  • the model learning unit 12 updates the model parameters based on all the learning data, and repeats this an arbitrary number of times (this repeated process is called an "epoch", and the number of repetitions is called the "epoch number").
  • the parameter updating method may be similar to general learning of neural networks.
  • FIG. 3 is a flowchart for explaining an example of the processing procedure of the model parameter learning process.
  • step S101 the model learning unit 12 randomly divides a plurality of learning data into a plurality of mini-batches.
  • the model learning unit 12 executes loop processing L1 for every mini-batch.
  • the mini-batch to be processed in the loop processing L1 is referred to as a "target batch.”
  • the model learning unit 12 executes steps S102 to S109 for the target batch.
  • step S102 by inputting the text T i of each learning data included in the target batch to the context encoding unit 111, a vector (feature amount u i ) generated by the context encoding unit 111 is obtained for each learning data concerned. . For example, if the size of the mini-batch is 10, 10 feature quantities u i are obtained.
  • the model learning unit 12 inputs the image I i + of each learning data included in the target batch to the image encoding unit 112, thereby generating a vector (feature value v i + ) generated by the image encoding unit 112. is obtained for each learning data (S103). For example, if the mini-batch size is 10, 10 feature quantities v i + are obtained.
  • the model learning unit 12 calculates the result for any text T ⁇ ⁇ T′ ⁇ T i ⁇ , where T′ is the text set in the target batch. Then, one character string s k that is an arbitrary noun included in T - and not included in T i is selected (S104).
  • T i is the correct text. Therefore, regarding the training data, T'- ⁇ T i ⁇ is a set of texts obtained by removing the correct text from the set of texts in the target batch. T- is the text of one of the sets. Therefore, T - is one text other than the correct text (unrelated to the correct image) among the text sets in the target batch. That is, the character string s k , which is an arbitrary noun included in T - and not included in T i , is one noun that is not included in the correct text.
  • T i of the training data is "A dog is running around in the park", and the size is 10 in a mini-batch.
  • T' be the text set, (1) A dog is running around in the park (2) A cat is lying down in the garden (3)... : (10)...
  • T'- ⁇ T i ⁇ is the nine texts (2) to (10) of the above T'.
  • T - is one text (for example, "The cat is lying in the garden") arbitrarily (randomly) selected from (2) to (10).
  • the character string s k is, for example, a noun included in "The cat is lying in the garden" and not included in "The dog is running around in the park.” However, s k is not selected from T - , but is selected from among nouns included in the entire vocabulary set (a vocabulary set not limited to learning data) that is not included in T i . may be done.
  • the character string sk may be selected randomly.
  • a word vector may be obtained in advance using FastText or the like, and the word having the longest average distance from the word of the correct text in the space of the word vector may be selected as the character string s k .
  • the model learning unit 12 converts the character string s k associated with the learning data into I i + with a random font size that does not exceed the image size of the image I i + of the learning data.
  • the font size may be given in advance, or a minimum size may be defined, and then the font size may be determined continuously at random within that range.
  • I i - may be generated in plural numbers or one for one learning data. When a plurality of I i -s are generated, a plurality of character strings s k may be selected for one learning data.
  • the model learning unit 12 inputs the image I ik ⁇ related to each learning data of the target batch to the image encoding unit 112, and the feature amount v i k generated by the image encoding unit 112 for each I ik ⁇ - is obtained (S106).
  • the target batch is Text feature set: ⁇ u 1 , u 2 ,..., u b ⁇ (where b is the mini-batch size)
  • model learning unit 12 calculates the output value of the softmax function of the inner product of text to image and image to text for each of the positive and negative examples (S107).
  • FIG. 4 is a diagram for explaining a method of calculating the output value and loss of the softmax function of text and image for each of the positive and negative examples.
  • text feature sets ⁇ u 1 , u 2 , ..., u b ⁇ are arranged in columns
  • image feature sets ⁇ v 1 + , v 2 + , ..., v b + , v 11 ⁇ , ..., v bl ⁇ ⁇ are arranged in the row direction.
  • softmax output value the softmax function
  • the model learning unit 12 calculates an inner product between the feature quantity uj of a certain text and the image feature quantity set ⁇ v 1 + , v 2 + , ..., v b + , v 11 - , ..., v bl - ⁇ and the softmax output value are calculated in the same way.
  • the softmax output value (from text to image) corresponding to the inner product of all columns in Figure 4 is calculated. .
  • the model learning unit 12 calculates a loss (softmax cross entropy loss) for each row and each column of the positive example in FIG. 4 based on the set of calculated softmax output values, and calculates the average or sum of these losses. is calculated as the loss in the target batch (S108).
  • p(x) is the true distribution
  • q(x) is the predicted distribution.
  • a class label is applied to this true distribution and a softmax output value is fitted to the predicted distribution to calculate the loss for each row or column of positive examples.
  • the average or total sum of losses in the row direction is the image loss
  • the average or total sum of the losses in the column direction is the text loss.
  • the average or sum of the image loss and text loss is taken as the loss in the target batch.
  • the loss may be calculated using only the softmax output value of the text to image.
  • the model learning unit 12 updates the model parameters of the context encoding unit 111 and the image encoding unit 112 based on the loss in the target model (S108). Specifically, the model learning unit 12 calculates a gradient for each model parameter from the loss using the back error propagation method, and updates the model parameter using an arbitrary optimization method.
  • model learning unit 12 determines whether a predetermined termination condition is satisfied (S110). If the termination condition is not satisfied (No in S110), the model learning unit 12 repeats steps S101 and subsequent steps. When the termination condition is satisfied (Yes in S110), the model learning unit 12 terminates the process of FIG. 3.
  • embedding models for text and images are learned using images in which character strings that are negative examples (irrelevant) of relevance to text are embedded. . Therefore, the influence of irrelevant character strings embedded in images can be reduced for text and image embedding models. As a result, it is possible to learn a model that is robust against, for example, adversarial character embedding attacks.
  • the processor includes: Obtain features of multiple texts using the first model, For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. and the feature amount using the second model, A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. update parameters,
  • a learning device characterized by:
  • the search device 10 is an example of a learning device.
  • the model learning unit 12 is an example of a first acquisition unit, a second acquisition unit, and a learning unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

This training device includes: a first acquisition unit that uses a first model to acquire feature amounts for a plurality of pieces of text; a second acquisition unit that uses a second model to acquire, for each of the pieces of text, a feature amount of a first image which is a positive example regarding the relevancy of the piece of text, and a feature amount of a second image in which is embedded a character string not included in the piece of text with respect to the first image; and a training unit that calculates a loss on the basis of the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and that updates parameters of the first model and the second model on the basis of the loss. The training device thereby reduces the effect due to an irrelevant character string embedded in an image, for a text- and image-embedding model.

Description

学習装置、学習方法及びプログラムLearning devices, learning methods and programs
 本発明は、学習装置、学習方法及びプログラムに関する。 The present invention relates to a learning device, a learning method, and a program.
 画像とテキストの特徴量を同じ空間に埋め込み、その特徴量に基づいて画像とテキストをコンピュータに理解させる技術が広まっている。具体的には、当該空間上での画像とテキストの距離を測り、当該距離を検索スコアとして利用することにより、テキストから画像を検索する技術、又は画像からテキストを検索する技術が挙げられる(非特許文献1、非特許文献2)。 Technology is becoming widespread that embeds features of images and text in the same space and allows computers to understand the images and text based on those features. Specifically, there are technologies that measure the distance between an image and text in the space and use that distance as a search score to search for an image from text or search for text from an image. Patent Document 1, Non-Patent Document 2).
 これらの技術では、埋め込み写像の作成の方策として、画像認識モデルと言語モデルをそれぞれ用意し、これらを用いて関連する画像とテキストの組みからなる大規模なデータセットでニューラルネットワークのネットワークパラメータを学習する。これにより関連する画像とテキストを同じ空間の近い距離に埋め込むことが可能となり、視覚と文字の情報を活用するvision and language分野で評価されてきた。 In these technologies, as a strategy for creating embedded mappings, image recognition models and language models are prepared respectively, and these are used to learn the network parameters of neural networks on large-scale datasets consisting of related image and text pairs. do. This makes it possible to embed related images and text close together in the same space, and has been evaluated in the field of vision and language, which utilizes visual and textual information.
 しかしながら、従来のモデルのうち、特に非特許文献1のモデルでは、画像内に書かれた文字を認識する精度が高すぎるが故に、故意に無関係な文字(学習時に与えられる訓練データの画像に正解として対応付けられたテキストに含まれない文字)を画像に埋め込んだ場合には無関係な文字に強く反応してしまい、画像内の視覚情報を正しく認識できない場合があることが報告されている。この現象は画像内の文字のサイズや位置によって結果が変わってくるが、悪意のある文字埋め込みだけでなく、企業ロゴや看板など自然に存在する文字情報から欲しい視覚情報が取得できない可能性を示唆しており、この誤識別によって実運用で問題となる可能性がある。 However, among conventional models, especially the model in Non-Patent Document 1, the accuracy of recognizing characters written in images is too high, so they intentionally write unrelated characters (correct answers to images of training data given during learning). It has been reported that if characters that are not included in the associated text are embedded in an image, a strong reaction will occur to unrelated characters, making it impossible to correctly recognize the visual information in the image. The results of this phenomenon vary depending on the size and position of the text in the image, but it suggests that the desired visual information may not be obtained from naturally occurring text information such as corporate logos and signboards, as well as malicious text embedding. This misidentification may cause problems in actual operation.
 例えば、テキストをクエリとした画像検索タスクにおいては、クエリと無関係な文字が画像に含まれる場合に、無関係な文字に強く反応してしまい、画像内の視覚情報を正しく認識できない可能性がある。 For example, in an image search task using text as a query, if the image contains characters unrelated to the query, the system may react strongly to the unrelated characters and may not be able to correctly recognize the visual information in the image.
 本発明は、上記の点に鑑みてなされたものであって、テキスト及び画像の埋め込みモデルについて、画像に埋め込まれた無関係な文字列による影響を低下させることの提供を目的とする。 The present invention has been made in view of the above points, and it is an object of the present invention to provide a text and image embedding model that reduces the influence of unrelated character strings embedded in images.
 そこで上記課題を解決するため、学習装置は、複数のテキストの特徴量を第1のモデルを用いて取得する第1の取得部と、前記テキストごとに、当該テキストの関連性について正例である第1の画像の特徴量と、当該第1の画像に対して当該テキストには含まれない文字列が埋め込まれた第2の画像の特徴量とを第2のモデルを用いて取得する第2の取得部と、前記テキストの特徴量と、前記第1の画像の特徴量及び前記第2の画像の特徴量とに基づいて損失を計算し、前記損失に基づいて前記第1のモデル及び前記第2のモデルのパラメータを更新する学習部と、を有する。 Therefore, in order to solve the above problem, the learning device includes a first acquisition unit that acquires feature quantities of a plurality of texts using a first model, and a first acquisition unit that acquires feature quantities of a plurality of texts using a first model, and a learning device that determines, for each text, a positive example of the relevance of the text. A second method that uses a second model to obtain the feature amount of the first image and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and based on the loss, the first model and the and a learning unit that updates parameters of the second model.
 テキスト及び画像の埋め込みモデルについて、画像に埋め込まれた無関係な文字列による影響を低下させることができる。 For text and image embedding models, it is possible to reduce the influence of unrelated character strings embedded in images.
本発明の実施の形態における検索装置10のハードウェア構成例を示す図である。1 is a diagram showing an example of a hardware configuration of a search device 10 according to an embodiment of the present invention. 本発明の実施の形態における検索装置10の機能構成例を示す図である。1 is a diagram showing an example of a functional configuration of a search device 10 according to an embodiment of the present invention. モデルパラメータの学習処理の処理手順の一例を説明するためのフローチャートである。3 is a flowchart for explaining an example of a processing procedure of model parameter learning processing. 正例及び負例のそれぞれについてのテキストと画像のソフトマックス関数の出力値及び損失の計算方法を説明するための図である。FIG. 6 is a diagram for explaining a method of calculating the output value and loss of a softmax function for text and images for each of a positive example and a negative example.
 以下、図面に基づいて本発明の実施の形態を説明する。本実施の形態では、検索対象の画像が与えられた時に、検索テキストクエリに関連する画像を抽出する検索タスクを一例として、検索タスクを実行する検索装置10について説明する。 Embodiments of the present invention will be described below based on the drawings. In this embodiment, a search device 10 that executes a search task will be described, taking as an example a search task of extracting an image related to a search text query when an image to be searched is given.
 図1は、本発明の実施の形態における検索装置10のハードウェア構成例を示す図である。図1の検索装置10は、それぞれバスBで相互に接続されているドライブ装置100、補助記憶装置102、メモリ装置103、プロセッサ104、及びインタフェース装置105等を有する。 FIG. 1 is a diagram showing an example of the hardware configuration of a search device 10 in an embodiment of the present invention. The search device 10 in FIG. 1 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, etc., which are interconnected by a bus B.
 検索装置10での処理を実現するプログラムは、CD-ROM等の記録媒体101によって提供される。プログラムを記憶した記録媒体101がドライブ装置100にセットされると、プログラムが記録媒体101からドライブ装置100を介して補助記憶装置102にインストールされる。但し、プログラムのインストールは必ずしも記録媒体101より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置102は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 A program that realizes the processing in the search device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100. However, the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via a network. The auxiliary storage device 102 stores installed programs as well as necessary files, data, and the like.
 メモリ装置103は、プログラムの起動指示があった場合に、補助記憶装置102からプログラムを読み出して格納する。プロセッサ104は、CPU若しくはGPU(Graphics Processing Unit)、又はCPU及びGPUであり、メモリ装置103に格納されたプログラムに従って検索装置10に係る機能を実行する。インタフェース装置105は、ネットワークに接続するためのインタフェースとして用いられる。 The memory device 103 reads and stores the program from the auxiliary storage device 102 when there is an instruction to start the program. The processor 104 is a CPU, a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the search device 10 according to a program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.
 図2は、本発明の実施の形態における検索装置10の機能構成例を示す図である。図2において、検索装置10は、検索部11及びモデル学習部12を有する。これら各部は、検索装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。 FIG. 2 is a diagram showing an example of the functional configuration of the search device 10 according to the embodiment of the present invention. In FIG. 2, the search device 10 includes a search section 11 and a model learning section 12. Each of these units is realized by one or more programs installed in the search device 10 causing the processor 104 to execute the process.
 検索部11は、検索タスクを実行する。検索部11への入力は、検索クエリQ、検索対象画像集合{I,I,…,I}の2種類である。検索部11からの出力は、検索クエリQに関連する画像(以下、「関連画像」という。)の順序集合{I,I,…,I}及び各関連画像のQに対する関連度{S,S,…,S}である。ここで、mは検索対象画像数であり、kは検索で得られた関連画像数を表す。 The search unit 11 executes a search task. There are two types of input to the search unit 11: a search query Q and a search target image set {I 0 , I 1 , . . . , I m }. The output from the search unit 11 is an ordered set {I 0 , I 1 , ..., I k } of images related to the search query Q (hereinafter referred to as "related images") and the degree of relevance of each related image to Q { S 0 , S 1 , ..., S k }. Here, m is the number of images to be searched, and k is the number of related images obtained by the search.
 検索部11は、文脈符号化部111、画像符号化部112及びランキング部113を含む。文脈符号化部111及び画像符号化部112はニューラルネットワークで実装される。ニューラルネットワークにおける演算処理は、全て各モデル(ニューラルネットワーク)に対応する、学習済みのパラメータに基づいて実行される。 The search unit 11 includes a context encoding unit 111, an image encoding unit 112, and a ranking unit 113. The context encoding unit 111 and the image encoding unit 112 are implemented using a neural network. All arithmetic processing in the neural network is performed based on learned parameters corresponding to each model (neural network).
 文脈符号化部111は、任意の1文を構成する文字列を検索クエリQとして入力し、文脈符号化部111としての学習済みモデルのパラメータに基づき、検索クエリの特徴量uを出力(生成)する。文脈符号化部111としての具体的なニューラルネットワークのモデルとしては、テキスト情報を符号化するものであれば特定のものに限定されない。例えば、非特許文献1で利用されているtext encoderのモデルが使用されてもよい。当該モデルはテキストを入力してd次元の特徴量を出力するものであるが、transformerを用いた文脈考慮型の事前学習モデルであれば、他のモデルが用いられてもよい。 The context encoding unit 111 inputs a character string constituting an arbitrary sentence as a search query Q, and outputs (generates) a feature quantity u of the search query based on the parameters of the learned model as the context encoding unit 111. do. A specific neural network model as the context encoding unit 111 is not limited to a specific model as long as it encodes text information. For example, the text encoder model used in Non-Patent Document 1 may be used. Although this model inputs text and outputs d-dimensional feature amounts, other models may be used as long as they are context-sensitive pre-learning models using transformers.
 画像符号化部112は、検索対象画像集合を構成する各検索対象画像Iを入力とし、画像符号化部112としての学習済みモデルのパラメータに基づき、検索対象画像Iの特徴量vを出力(生成)する。但し、i=0,1,…,mである。画像符号化部112としての具体的なニューラルネットワークのモデルとしては、画像を入力として受け取り、d次元のベクトルを出力するモデルであれば特定のモデルに限定されない。但し、文脈符号化部111と画像符号化部112の出力ベクトルの次元dは一致する必要がある。例えば、非特許文献1で利用されているimage encoderのモデルが使用されてもよい。非特許文献1では、画像を入力してd次元の特徴量を出力するモデルとしてResNet及びViTが用意されている。これらが用いられてもよいし、他のモデルが用いられてもよい。 The image encoding unit 112 inputs each search target image I i constituting the search target image set, and calculates the feature quantity v i of the search target image I i based on the parameters of the trained model as the image encoding unit 112. Output (generate). However, i=0, 1,..., m. A specific neural network model for the image encoding unit 112 is not limited to a specific model as long as it receives an image as an input and outputs a d-dimensional vector. However, the dimensions d of the output vectors of the context encoding unit 111 and the image encoding unit 112 need to match. For example, the image encoder model used in Non-Patent Document 1 may be used. In Non-Patent Document 1, ResNet and ViT are prepared as models that input an image and output a d-dimensional feature amount. These or other models may be used.
 ランキング部113は、検索クエリQに関して文脈符号化部111から出力される特徴量uと、各検索対象画像Iに関して画像符号化部112から出力される特徴量vの集合(以下、「特徴量集合」という。){v,…,v,…,v}とを入力とし、検索クエリQの関連画像の順序集合{I,…,I,…,I}及び各関連画像のQに対する関連度{S,…,S,…,S}を出力する。 The ranking unit 113 includes a feature quantity u outputted from the context encoding unit 111 regarding the search query Q, and a set of feature quantities v i outputted from the image encoding unit 112 regarding each search target image I i (hereinafter referred to as “feature ) {v 1 ,..., v i ,..., v m } is input, and an ordered set of related images of the search query Q {I 1 ,..., I i ,..., I k } and each The degree of association {S 1 , ..., S i , ..., S k } with respect to Q of the related images is output.
 或る画像Iと検索クエリQの関連度Sは、適当な距離関数fを用いて、S=f(u,v)で計算される。具体的な実装例として、fは、u及びvの内積の逆数などが使用されてもよい。fの入力は同じ次元の2つのベクトルであり、出力はスカラーである。又は、内積の逆数以外にもベクトル間の距離を測ることができる距離関数がfとして用いられてもよい。 The degree of relevance S i between a certain image I i and the search query Q is calculated as S i =f(u, v i ) using an appropriate distance function f. As a specific implementation example, f may be the reciprocal of the inner product of u and v i . The inputs of f are two vectors of the same dimension, and the output is a scalar. Alternatively, in addition to the reciprocal of the inner product, a distance function that can measure the distance between vectors may be used as f.
 モデル学習部12は、文脈符号化部111及びモデル学習部12のそれぞれのモデルパラメータを学習する。 The model learning unit 12 learns model parameters of the context encoding unit 111 and the model learning unit 12.
 学習前の準備として、検索タスク用の学習用データが事前に収集される。例えば、「Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64-73, 2016」で収集されたデータが用いられる。学習用データは、テキスト集合T={T,T,…,T_c}と画像集合I={I,I,…,I'}で構成される。 As preparation before learning, training data for the search task is collected in advance. For example, "Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The new data Data collected in "In multimedia research. Communications of the ACM, 59(2):64-73, 2016" will be used. The learning data is composed of a text set T={T 0 , T 1 , . . . , T_c} and an image set I={I 0 , I 1 , . . . , I m ′}.
 更に、テキストTに対して関連する(テキストTとの関連性について正例であるとする)画像集合I={I|I:Tに関連する文書}は正解データとしてラベル付けされている。これらの事前に収集したデータセットに対し、各Tに関連する画像集合Iからランダムに1つ抽出した画像(正例の画像)をI として、(T,I )が1セットとして学習データが作成される。つまり、事前に用意される学習データは、テキストと正例の画像との組である。 Furthermore, the image set I i that is related to the text T i (assumed to be a positive example in relation to the text T i ) = {I j | I j : document related to T i } is labeled as correct data. It is attached. For these data sets collected in advance, one image (positive example image) randomly extracted from the image set I i related to each T i is set as I i + , and (T i , I i + ) is Learning data is created as one set. In other words, the learning data prepared in advance is a set of text and positive example images.
 モデル学習部12は、このような学習データを用いた教師あり学習で文脈符号化部111及び画像符号化部112それぞれのモデルパラメータを更新する。なお、文脈符号化部111及び画像符号化部112のそれぞれのモデルパラメータは、事前に適当な初期値で初期化されているものとする(非特許文献1のモデル構造を利用する場合、初期化パラメータとして既存の学習済みモデルのパラメータが用いられてもよい。)。 The model learning unit 12 updates the model parameters of the context encoding unit 111 and the image encoding unit 112 through supervised learning using such learning data. It is assumed that the model parameters of the context encoding unit 111 and the image encoding unit 112 are initialized in advance with appropriate initial values (when using the model structure of Non-Patent Document 1, the model parameters of the context encoding unit 111 and the image encoding unit 112 are Parameters of an existing trained model may be used as parameters.)
 モデル学習部12は、全ての学習データに基づいてモデルパラメータを更新し、これを任意の回数繰り返す(この繰り返し処理を「エポック」といい、繰り返し回数を「エポック数」という。)。パラメータの更新方法についてはニューラルネットワークの一般的な学習と同様でよい。 The model learning unit 12 updates the model parameters based on all the learning data, and repeats this an arbitrary number of times (this repeated process is called an "epoch", and the number of repetitions is called the "epoch number"). The parameter updating method may be similar to general learning of neural networks.
 以下、検索装置10がモデルの学習について実行する処理手順について説明する。図3は、モデルパラメータの学習処理の処理手順の一例を説明するためのフローチャートである。 Hereinafter, the processing procedure executed by the search device 10 regarding model learning will be described. FIG. 3 is a flowchart for explaining an example of the processing procedure of the model parameter learning process.
 ステップS101において、モデル学習部12は、複数の学習データをランダムに複数のミニバッチに分割する。 In step S101, the model learning unit 12 randomly divides a plurality of learning data into a plurality of mini-batches.
 続いて、モデル学習部12は、全てのミニバッチごとにループ処理L1を実行する。ループ処理L1において処理対象とされているミニバッチを「対象バッチ」という。ループ処理L1内において、モデル学習部12は、対象バッチに関してステップS102~S109を実行する。 Subsequently, the model learning unit 12 executes loop processing L1 for every mini-batch. The mini-batch to be processed in the loop processing L1 is referred to as a "target batch." In the loop process L1, the model learning unit 12 executes steps S102 to S109 for the target batch.
 ステップS102において、対象バッチに含まれる各学習データのテキストTを文脈符号化部111に入力することで、文脈符号化部111が生成するベクトル(特徴量u)を当該学習データごとに得る。例えば、ミニバッチのサイズが10であれば、10個の特徴量uが得られる。 In step S102, by inputting the text T i of each learning data included in the target batch to the context encoding unit 111, a vector (feature amount u i ) generated by the context encoding unit 111 is obtained for each learning data concerned. . For example, if the size of the mini-batch is 10, 10 feature quantities u i are obtained.
 続いて、モデル学習部12は、対象バッチに含まれる各学習データの画像I を画像符号化部112へ入力することで、画像符号化部112が生成するベクトル(特徴量v )を当該学習データごとに得る(S103)。例えば、ミニバッチのサイズが10であれば、10個の特徴量v が得られる。 Next, the model learning unit 12 inputs the image I i + of each learning data included in the target batch to the image encoding unit 112, thereby generating a vector (feature value v i + ) generated by the image encoding unit 112. is obtained for each learning data (S103). For example, if the mini-batch size is 10, 10 feature quantities v i + are obtained.
 続いて、モデル学習部12は、対象バッチに含まれる学習データの画像ごとに、対象バッチ内のテキスト集合をT'とした場合について、任意のテキストT∈T'-{T}に対して、Tに含まれ、かつ、Tに含まれない任意の名詞である文字列sを1つ選択する(S104)。ここで、或る学習データに関して、Tは、正解テキストである。したがって、当該学習データに関して、T'-{T}は、対象バッチ内のテキスト集合から正解テキストを除いたテキストの集合である。Tは、当該集合のうちの一つのテキストである。よって、Tは、対象バッチ内のテキスト集合のうち、正解テキスト以外の(正解画像と無関係な)1つのテキストである。すなわち、Tに含まれ、かつTに含まれない任意の名詞である文字列sは、正解テキストには含まれない1つの名詞である。 Next, for each image of the learning data included in the target batch, the model learning unit 12 calculates the result for any text T ∈T′−{T i }, where T′ is the text set in the target batch. Then, one character string s k that is an arbitrary noun included in T - and not included in T i is selected (S104). Here, with respect to certain learning data, T i is the correct text. Therefore, regarding the training data, T'-{T i } is a set of texts obtained by removing the correct text from the set of texts in the target batch. T- is the text of one of the sets. Therefore, T - is one text other than the correct text (unrelated to the correct image) among the text sets in the target batch. That is, the character string s k , which is an arbitrary noun included in T - and not included in T i , is one noun that is not included in the correct text.
 例えば、或る学習データの画像I (正解画像)が犬の画像であり、当該学習データのテキストTが「犬が公園を走り回っている」であり、サイズが10であるミニバッチ内のテキスト集合をT'が、
(1)犬が公園を走り回っている
(2)猫が庭で寝転んでいる
(3)・・・
 :
(10)・・・
であるとする。この場合、T'-{T}は、上記のT'のうちの(2)~(10)の9個のテキストである。Tは、(2)~(10)の中から任意に(ランダムに)選択された1つのテキスト(例えば、「猫が庭で寝転んでいる」)である。文字列sは、例えば、「猫が庭で寝転んでいる」に含まれ、かつ、「犬が公園を走り回っている」に含まれない名詞である。但し、sとしては、Tの中から選択されるのではなく、全語彙集合(学習データに限られない語彙集合)に含まれる名詞のうち、Tに含まれない名詞の中から選択されてもよい。
For example, if the image I i + (correct image) of certain training data is an image of a dog, the text T i of the training data is "A dog is running around in the park", and the size is 10 in a mini-batch. Let T' be the text set,
(1) A dog is running around in the park (2) A cat is lying down in the garden (3)...
:
(10)...
Suppose that In this case, T'-{T i } is the nine texts (2) to (10) of the above T'. T - is one text (for example, "The cat is lying in the garden") arbitrarily (randomly) selected from (2) to (10). The character string s k is, for example, a noun included in "The cat is lying in the garden" and not included in "The dog is running around in the park." However, s k is not selected from T - , but is selected from among nouns included in the entire vocabulary set (a vocabulary set not limited to learning data) that is not included in T i . may be done.
 なお、文字列sの選び方はランダムでもよい。又は、FastTextなどで予め単語ベクトルを求め、その単語ベクトルの空間上で正解テキストの単語との平均距離が最も遠くなる単語を文字列sとして選ぶようにしてもよい。 Note that the character string sk may be selected randomly. Alternatively, a word vector may be obtained in advance using FastText or the like, and the word having the longest average distance from the word of the correct text in the space of the word vector may be selected as the character string s k .
 続いて、モデル学習部12は、対象バッチの学習データごとに、当該学習データの画像I の画像サイズを超えないランダムなフォントサイズで、当該学習データに係る文字列sをI の複製に埋め込む(重畳させる)ことで、負例の画像をIik を生成する(S105)。フォントサイズについては予め候補が数個与えられてもよいし、最低の大きさを定義した上でその範囲内で連続的にランダムに決められてもよい。また、I は、1つの学習データに対して複数個生成されてもよいし1つでもよい。複数個のI が生成される場合、文字列sは、1つの学習データに対して複数選択されればよい。 Next, for each target batch of learning data, the model learning unit 12 converts the character string s k associated with the learning data into I i + with a random font size that does not exceed the image size of the image I i + of the learning data. By embedding (superimposing) the negative example image in the copy of ( S105). Several candidates for the font size may be given in advance, or a minimum size may be defined, and then the font size may be determined continuously at random within that range. Further, I i - may be generated in plural numbers or one for one learning data. When a plurality of I i -s are generated, a plurality of character strings s k may be selected for one learning data.
 続いて、モデル学習部12は、対象バッチの各学習データに係る画像Iik を画像符号化部112に入力して、画像符号化部112が各Iik について生成する特徴量vを得る(S106)。 Next, the model learning unit 12 inputs the image I ik related to each learning data of the target batch to the image encoding unit 112, and the feature amount v i k generated by the image encoding unit 112 for each I ik - is obtained (S106).
 この時点において、対象バッチ内は、
テキストの特徴量集合:{u,u,…,u}(但し、bはミニバッチサイズ)
画像の特徴量集合:{v ,v ,…,v ,v11 ,…,vbl }(ただしlは負例画像の数)
となる。
At this point, the target batch is
Text feature set: {u 1 , u 2 ,..., u b } (where b is the mini-batch size)
Image feature set: {v 1 + , v 2 + ,..., v b + , v 11 - ,..., v bl - } (where l is the number of negative example images)
becomes.
 続いて、モデル学習部12は、正例及び負例のそれぞれについて、テキストから画像、画像からテキストの内積のソフトマックス(softmax)関数の出力値を計算する(S107)。 Next, the model learning unit 12 calculates the output value of the softmax function of the inner product of text to image and image to text for each of the positive and negative examples (S107).
 この計算方法は、図を用いて説明する。図4は、正例及び負例のそれぞれについてのテキストと画像のソフトマックス関数の出力値及び損失の計算方法を説明するための図である。図4には、テキストの特徴量集合{u,u,…,u}が列方向に並べられ、画像の特徴量集合{v ,v ,…,v ,v11 ,…,vbl }が行方向に並べられている。 This calculation method will be explained using figures. FIG. 4 is a diagram for explaining a method of calculating the output value and loss of the softmax function of text and image for each of the positive and negative examples. In FIG. 4, text feature sets {u 1 , u 2 , ..., u b } are arranged in columns, and image feature sets {v 1 + , v 2 + , ..., v b + , v 11 , ..., v bl } are arranged in the row direction.
 モデル学習部12は、或る画像からテキストについて、当該画像の特徴量vと、テキストの特徴量集合{u,u,…,u}のうちの1つの特徴量uとの内積を以下のように計算する。
j=v・u
 当該内積を各特徴量uについて計算することで、
i1,xi2,…,xib
が得られる。モデル学習部12は、xi1,xi2,…,xibに基づいて、図4において1行分の内積ごとにソフトマックス関数の出力値(以下、「ソフトマックス出力値」という。)を計算する。これを正例の画像の特徴量集合{v ,v ,…,v }について実行することで、図4において正例の各行の内積に対応するソフトマックス出力値が計算される。
The model learning unit 12 calculates, for text from a certain image, the feature quantity v i of the image and one feature quantity u j of the text feature quantity set {u 1 , u 2 , ..., u b }. Calculate the inner product as follows.
x i j=v i・u j
By calculating the inner product for each feature u i ,
x i1 , x i2 ,..., x ib
is obtained. The model learning unit 12 calculates the output value of the softmax function (hereinafter referred to as "softmax output value") for each inner product of one row in FIG. 4 based on x i1 , x i2 , ..., x ib do. By executing this for the feature set {v 1 + , v 2 + , ..., v b + } of the image of the positive example, the softmax output value corresponding to the inner product of each row of the positive example in Figure 4 is calculated. Ru.
 また、モデル学習部12は、或るテキストの特徴量ujと、画像の特徴量集合{v ,v ,…,v ,v11 ,…,vbl }との内積及びソフトマックス出力値も同様に計算する。これをテキストの特徴量集合{u,u,…,u}について実行することで、図4において全ての列の内積に対応する(テキストから画像の)ソフトマックス出力値が計算される。 In addition, the model learning unit 12 calculates an inner product between the feature quantity uj of a certain text and the image feature quantity set {v 1 + , v 2 + , ..., v b + , v 11 - , ..., v bl - } and the softmax output value are calculated in the same way. By executing this for the text feature set {u 1 , u 2 , ..., u b }, the softmax output value (from text to image) corresponding to the inner product of all columns in Figure 4 is calculated. .
 続いて、モデル学習部12は、計算したソフトマックス出力値の集合に基づいて、図4の正例の各行、及び各列について損失(ソフトマックスクロスエントロピーロス)を計算し、これらの平均又は和を対象バッチにおける損失として計算する(S108)。但し、損失の計算において、画像からテキストの(図4において行方向の)ソフトマックス出力値に対するクラスラベル(正解ラベル)は、図4に示されるように、i=jの場合に1とされ、それ以外は全て0とされる。また、テキストから画像の(図4において列方向の)ソフトマックス出力値に対するクラスラベル(正解ラベル)は、図4に示されるように、v のうち、i=jとなる箇所のみ1とされる。 Next, the model learning unit 12 calculates a loss (softmax cross entropy loss) for each row and each column of the positive example in FIG. 4 based on the set of calculated softmax output values, and calculates the average or sum of these losses. is calculated as the loss in the target batch (S108). However, in the loss calculation, the class label (correct label) for the softmax output value from the image to the text (in the row direction in FIG. 4) is set to 1 when i=j, as shown in FIG. All other values are set to 0. Also, as shown in FIG. 4, the class label (correct label) for the softmax output value from text to image (in the column direction in FIG. 4) is 1 only where i=j in v j + . be done.
 具体的には、クロスエントロピーの損失関数は、
H(p,q)=-Σp(x)logq(x)
である。ここで、p(x)が真の分布であり、q(x)が予測分布である。この真の分布にクラスラベルが適用され、予測分布にソフトマックス出力値が当てはめられて、正例の各行又は各列の損失が計算される。なお、行方向の損失の平均又は総和が画像の損失であり、列方向の損失の平均又は総和がテキストの損失である。画像の損失及びテキストの損失の平均又は総和が対象バッチにおける損失とされる。
Specifically, the cross-entropy loss function is
H(p,q)=-Σp(x)logq(x)
It is. Here p(x) is the true distribution and q(x) is the predicted distribution. A class label is applied to this true distribution and a softmax output value is fitted to the predicted distribution to calculate the loss for each row or column of positive examples. Note that the average or total sum of losses in the row direction is the image loss, and the average or total sum of the losses in the column direction is the text loss. The average or sum of the image loss and text loss is taken as the loss in the target batch.
 なお、モデルの初期バラメータとして既存の学習済みモデルを用いる場合、テキストから画像のソフトマックス出力値のみを用いて損失が計算されてもよい。 Note that when using an existing trained model as the initial parameter of the model, the loss may be calculated using only the softmax output value of the text to image.
 続いて、モデル学習部12は、対象モデルにおける損失に基づいて文脈符号化部111及び画像符号化部112のそれぞれのモデルパラメータを更新する(S108)。具体的には、モデル学習部12は、当該損失から逆誤差伝搬法を用いて各モデルパラメータについて勾配を計算し、任意の最適化手法を用いてモデルパラメータ更新する。 Next, the model learning unit 12 updates the model parameters of the context encoding unit 111 and the image encoding unit 112 based on the loss in the target model (S108). Specifically, the model learning unit 12 calculates a gradient for each model parameter from the loss using the back error propagation method, and updates the model parameter using an arbitrary optimization method.
 ループL1が全てのミニバッチについて実行されると、モデル学習部12は、所定の終了条件が満たされたか否かを判定する(S110)。終了条件が満たされない場合(S110でNo)、モデル学習部12は、ステップS101以降を繰り返す。終了条件が満たされると(S110でYes)、モデル学習部12は、図3の処理を終了する。 When loop L1 is executed for all mini-batches, model learning unit 12 determines whether a predetermined termination condition is satisfied (S110). If the termination condition is not satisfied (No in S110), the model learning unit 12 repeats steps S101 and subsequent steps. When the termination condition is satisfied (Yes in S110), the model learning unit 12 terminates the process of FIG. 3.
 上述したように、本実施の形態によれば、テキストとの関連性について負例である(無関係)な文字列が埋め込まれた画像を用いて、テキスト及び画像のそれぞれの埋め込みモデルが学習される。したがって、テキスト及び画像の埋め込みモデルについて画像に埋め込まれた無関係な文字列による影響を低下させることができる。その結果、例えば、敵対的文字埋め込み攻撃に対して頑健なモデルを学習することができる。 As described above, according to the present embodiment, embedding models for text and images are learned using images in which character strings that are negative examples (irrelevant) of relevance to text are embedded. . Therefore, the influence of irrelevant character strings embedded in images can be reduced for text and image embedding models. As a result, it is possible to learn a model that is robust against, for example, adversarial character embedding attacks.
 但し、画像内の文字の影響を小さくするように学習しすぎると、必要な情報による影響を受けづらいモデルが学習されてしまう。そこで、本実施の形態では、正例についても学習が行われる。そうすることで、自然に埋め込まれた文字を認識できる能力を維持しつつ、無関係な文字の影響を低下させることができる。 However, if too much learning is performed to reduce the influence of characters in the image, a model will be trained that is less susceptible to the influence of necessary information. Therefore, in this embodiment, learning is also performed for positive examples. This reduces the influence of extraneous characters while preserving the ability to recognize naturally embedded characters.
 以上の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiments, the following additional notes are further disclosed.
 (付記項1)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 複数のテキストの特徴量を第1のモデルを用いて取得し、
 前記テキストごとに、当該テキストの関連性について正例である第1の画像の特徴量と、当該第1の画像に対して当該テキストには含まれない文字列が埋め込まれた第2の画像の特徴量とを第2のモデルを用いて取得し、
 前記テキストの特徴量と、前記第1の画像の特徴量及び前記第2の画像の特徴量とに基づいて損失を計算し、前記損失に基づいて前記第1のモデル及び前記第2のモデルのパラメータを更新する、
ことを特徴とする学習装置。
(Additional note 1)
memory and
at least one processor connected to the memory;
including;
The processor includes:
Obtain features of multiple texts using the first model,
For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. and the feature amount using the second model,
A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. update parameters,
A learning device characterized by:
 (付記項2)
 複数のテキストの特徴量を第1のモデルを用いて取得し、
 前記テキストごとに、当該テキストの関連性について正例である第1の画像の特徴量と、当該第1の画像に対して当該テキストには含まれない文字列が埋め込まれた第2の画像の特徴量とを第2のモデルを用いて取得し、
 前記テキストの特徴量と、前記第1の画像の特徴量及び前記第2の画像の特徴量とに基づいて損失を計算し、前記損失に基づいて前記第1のモデル及び前記第2のモデルのパラメータを更新する、
処理をコンピュータに実行させるプログラムを記録した記録媒体。
(Additional note 2)
Obtain features of multiple texts using the first model,
For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. and the feature amount using the second model,
A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. update parameters,
A recording medium that records a program that causes a computer to execute a process.
 なお、本実施の形態において、検索装置10は、学習装置の一例である。モデル学習部12は、第1の取得部、第2の取得部及び学習部の一例である。 Note that in this embodiment, the search device 10 is an example of a learning device. The model learning unit 12 is an example of a first acquisition unit, a second acquisition unit, and a learning unit.
 以上、本発明の実施の形態について詳述したが、本発明は斯かる特定の実施形態に限定されるものではなく、請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present invention have been described in detail above, the present invention is not limited to these specific embodiments, and various modifications can be made within the scope of the gist of the present invention as described in the claims. - Can be changed.
10     検索装置
11     検索部
12     モデル学習部
100    ドライブ装置
101    記録媒体
102    補助記憶装置
103    メモリ装置
104    プロセッサ
105    インタフェース装置
111    文脈符号化部
112    画像符号化部
113    ランキング部
B      バス
10 Search device 11 Search section 12 Model learning section 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 111 Context encoding section 112 Image encoding section 113 Ranking section B bus

Claims (5)

  1.  複数のテキストの特徴量を第1のモデルを用いて取得する第1の取得部と、
     前記テキストごとに、当該テキストの関連性について正例である第1の画像の特徴量と、当該第1の画像に対して当該テキストには含まれない文字列が埋め込まれた第2の画像の特徴量とを第2のモデルを用いて取得する第2の取得部と、
     前記テキストの特徴量と、前記第1の画像の特徴量及び前記第2の画像の特徴量とに基づいて損失を計算し、前記損失に基づいて前記第1のモデル及び前記第2のモデルのパラメータを更新する学習部と、
    を有することを特徴とする学習装置。
    a first acquisition unit that acquires feature quantities of a plurality of texts using a first model;
    For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a second acquisition unit that acquires the feature amount using a second model;
    A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. a learning section that updates parameters;
    A learning device characterized by having.
  2.  前記学習部は、前記テキストごとに、当該テキストの特徴量と前記第1の画像の特徴量及び前記第2の画像の特徴量の内積を計算し、前記第1の画像ごとに、当該第1の画像とそれぞれの前記テキストの特徴量との内積を計算し、前記テキストごとの内積のソフトマックス関数の出力値のクロスエントロピーロスと、前記第1の画像ごとの内積のソフトマックス関数の出力値のクロスエントロピーロスとに基づいて前記損失を計算する、
    ことを特徴とする請求項1記載の学習装置。
    The learning unit calculates, for each text, an inner product of the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and calculates the inner product of the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and calculate the inner product of the image and each of the text features, and calculate the cross-entropy loss of the output value of the softmax function of the inner product for each text, and the output value of the softmax function of the inner product for each of the first images. calculating the loss based on the cross-entropy loss of
    The learning device according to claim 1, characterized in that:
  3.  前記第2の取得部は、それぞれの前記テキストについて、当該テキストの関連性について正例である前記第1の画像に対して当該テキストには含まれない文字列が埋め込まれた前記第2の画像を生成する、
    ことを特徴とする請求項1又は2記載の学習装置。
    The second acquisition unit is configured to acquire, for each of the texts, the second image in which a character string not included in the text is embedded in the first image, which is a positive example of the relevance of the text. generate,
    The learning device according to claim 1 or 2, characterized in that:
  4.  複数のテキストの特徴量を第1のモデルを用いて取得する第1の取得手順と、
     前記テキストごとに、当該テキストの関連性について正例である第1の画像の特徴量と、当該第1の画像に対して当該テキストには含まれない文字列が埋め込まれた第2の画像の特徴量とを第2のモデルを用いて取得する第2の取得手順と、
     前記テキストの特徴量と、前記第1の画像の特徴量及び前記第2の画像の特徴量とに基づいて損失を計算し、前記損失に基づいて前記第1のモデル及び前記第2のモデルのパラメータを更新する学習手順と、
    をコンピュータが実行することを特徴とする学習方法。
    a first acquisition procedure of acquiring feature quantities of a plurality of texts using a first model;
    For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a second acquisition procedure of acquiring the feature amount using a second model;
    A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. A learning procedure for updating parameters;
    A learning method characterized by being carried out by a computer.
  5.  複数のテキストの特徴量を第1のモデルを用いて取得する第1の取得手順と、
     前記テキストごとに、当該テキストの関連性について正例である第1の画像の特徴量と、当該第1の画像に対して当該テキストには含まれない文字列が埋め込まれた第2の画像の特徴量とを第2のモデルを用いて取得する第2の取得手順と、
     前記テキストの特徴量と、前記第1の画像の特徴量及び前記第2の画像の特徴量とに基づいて損失を計算し、前記損失に基づいて前記第1のモデル及び前記第2のモデルのパラメータを更新する学習手順と、
    をコンピュータに実行させることを特徴とするプログラム。
    a first acquisition procedure of acquiring feature quantities of a plurality of texts using a first model;
    For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a second acquisition procedure of acquiring the feature amount using a second model;
    A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. A learning procedure for updating parameters;
    A program that causes a computer to execute.
PCT/JP2022/031921 2022-08-24 2022-08-24 Training device, training method, and program WO2024042650A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/031921 WO2024042650A1 (en) 2022-08-24 2022-08-24 Training device, training method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/031921 WO2024042650A1 (en) 2022-08-24 2022-08-24 Training device, training method, and program

Publications (1)

Publication Number Publication Date
WO2024042650A1 true WO2024042650A1 (en) 2024-02-29

Family

ID=90012786

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/031921 WO2024042650A1 (en) 2022-08-24 2022-08-24 Training device, training method, and program

Country Status (1)

Country Link
WO (1) WO2024042650A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200380403A1 (en) * 2019-05-30 2020-12-03 Adobe Inc. Visually Guided Machine-learning Language Model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200380403A1 (en) * 2019-05-30 2020-12-03 Adobe Inc. Visually Guided Machine-learning Language Model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RADFORD ALEC, KIM JONG WOOK, HALLACY CHRIS, RAMESH ADITYA, GOH GABRIEL, AGARWAL SANDHINI, SASTRY GIRISH, ASKELL AMANDA, MISHKIN PA: "Learning transferable visual models from natural language supervision", 26 February 2021 (2021-02-26), XP093067451, Retrieved from the Internet <URL:https://arxiv.org/pdf/2103.00020.pdf> [retrieved on 20230726], DOI: 10.48550/arXiv.2103.00020 *
TOM B. BROWN; DANDELION MAN\'E; AURKO ROY; MART\'IN ABADI; JUSTIN GILMER: "Adversarial Patch", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 December 2017 (2017-12-27), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080848635 *

Similar Documents

Publication Publication Date Title
WO2023065544A1 (en) Intention classification method and apparatus, electronic device, and computer-readable storage medium
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN110619034A (en) Text keyword generation method based on Transformer model
CN111680494B (en) Similar text generation method and device
CN110852110B (en) Target sentence extraction method, question generation method, and information processing apparatus
CN112348911B (en) Semantic constraint-based method and system for generating fine-grained image by stacking texts
CN116450796B (en) Intelligent question-answering model construction method and device
CN116151132B (en) Intelligent code completion method, system and storage medium for programming learning scene
CN108536735B (en) Multi-mode vocabulary representation method and system based on multi-channel self-encoder
CN111046178B (en) Text sequence generation method and system
CN112380319A (en) Model training method and related device
CN116738994A (en) Context-enhanced-based hinting fine-tuning relation extraction method
RU2712101C2 (en) Prediction of probability of occurrence of line using sequence of vectors
CN116737938A (en) Fine granularity emotion detection method and device based on fine tuning large model online data network
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN115329120A (en) Weak label Hash image retrieval framework with knowledge graph embedded attention mechanism
CN111444720A (en) Named entity recognition method for English text
CN110956039A (en) Text similarity calculation method and device based on multi-dimensional vectorization coding
CN114048314A (en) Natural language steganalysis method
CN116226357B (en) Document retrieval method under input containing error information
WO2024042650A1 (en) Training device, training method, and program
WO2023192674A1 (en) Attention neural networks with parallel attention and feed-forward layers
CN113486160B (en) Dialogue method and system based on cross-language knowledge
WO2022185457A1 (en) Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program
CN117312506B (en) Page semantic information extraction method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22956478

Country of ref document: EP

Kind code of ref document: A1