WO2024042650A1

WO2024042650A1 - Training device, training method, and program

Info

Publication number: WO2024042650A1
Application number: PCT/JP2022/031921
Authority: WO
Inventors: 拓長谷川; 京介西田; いつみ斉藤; 仙吉田
Original assignee: 日本電信電話株式会社
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2024-02-29

Abstract

This training device includes: a first acquisition unit that uses a first model to acquire feature amounts for a plurality of pieces of text; a second acquisition unit that uses a second model to acquire, for each of the pieces of text, a feature amount of a first image which is a positive example regarding the relevancy of the piece of text, and a feature amount of a second image in which is embedded a character string not included in the piece of text with respect to the first image; and a training unit that calculates a loss on the basis of the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and that updates parameters of the first model and the second model on the basis of the loss. The training device thereby reduces the effect due to an irrelevant character string embedded in an image, for a text- and image-embedding model.

Description

Learning devices, learning methods and programs

The present invention relates to a learning device, a learning method, and a program.

Technology is becoming widespread that embeds features of images and text in the same space and allows computers to understand the images and text based on those features. Specifically, there are technologies that measure the distance between an image and text in the space and use that distance as a search score to search for an image from text or search for text from an image. Patent Document 1, Non-Patent Document 2).

In these technologies, as a strategy for creating embedded mappings, image recognition models and language models are prepared respectively, and these are used to learn the network parameters of neural networks on large-scale datasets consisting of related image and text pairs. do. This makes it possible to embed related images and text close together in the same space, and has been evaluated in the field of vision and language, which utilizes visual and textual information.

However, among conventional models, especially the model in Non-Patent Document 1, the accuracy of recognizing characters written in images is too high, so they intentionally write unrelated characters (correct answers to images of training data given during learning). It has been reported that if characters that are not included in the associated text are embedded in an image, a strong reaction will occur to unrelated characters, making it impossible to correctly recognize the visual information in the image. The results of this phenomenon vary depending on the size and position of the text in the image, but it suggests that the desired visual information may not be obtained from naturally occurring text information such as corporate logos and signboards, as well as malicious text embedding. This misidentification may cause problems in actual operation.

For example, in an image search task using text as a query, if the image contains characters unrelated to the query, the system may react strongly to the unrelated characters and may not be able to correctly recognize the visual information in the image.

The present invention has been made in view of the above points, and it is an object of the present invention to provide a text and image embedding model that reduces the influence of unrelated character strings embedded in images.

Therefore, in order to solve the above problem, the learning device includes a first acquisition unit that acquires feature quantities of a plurality of texts using a first model, and a first acquisition unit that acquires feature quantities of a plurality of texts using a first model, and a learning device that determines, for each text, a positive example of the relevance of the text. A second method that uses a second model to obtain the feature amount of the first image and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and based on the loss, the first model and the and a learning unit that updates parameters of the second model.

For text and image embedding models, it is possible to reduce the influence of unrelated character strings embedded in images.

1 is a diagram showing an example of a hardware configuration of a search device 10 according to an embodiment of the present invention. 1 is a diagram showing an example of a functional configuration of a search device 10 according to an embodiment of the present invention. 3 is a flowchart for explaining an example of a processing procedure of model parameter learning processing. FIG. 6 is a diagram for explaining a method of calculating the output value and loss of a softmax function for text and images for each of a positive example and a negative example.

Embodiments of the present invention will be described below based on the drawings. In this embodiment, a search device 10 that executes a search task will be described, taking as an example a search task of extracting an image related to a search text query when an image to be searched is given.

FIG. 1 is a diagram showing an example of the hardware configuration of a search device 10 in an embodiment of the present invention. The search device 10 in FIG. 1 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, etc., which are interconnected by a bus B.

A program that realizes the processing in the search device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100. However, the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via a network. The auxiliary storage device 102 stores installed programs as well as necessary files, data, and the like.

The memory device 103 reads and stores the program from the auxiliary storage device 102 when there is an instruction to start the program. The processor 104 is a CPU, a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the search device 10 according to a program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.

FIG. 2 is a diagram showing an example of the functional configuration of the search device 10 according to the embodiment of the present invention. In FIG. 2, the search device 10 includes a search section 11 and a model learning section 12. Each of these units is realized by one or more programs installed in the search device 10 causing the processor 104 to execute the process.

The search unit 11 executes a search task. There are two types of input to the search unit 11: a search query Q and a search target image set {I ₀ , I ₁ , . . . , I _m }. The output from the search unit 11 is an ordered set {I ₀ , I ₁ , ..., I _k } of images related to the search query Q (hereinafter referred to as "related images") and the degree of relevance of each related image to Q { S ₀ , S ₁ , ..., S _k }. Here, m is the number of images to be searched, and k is the number of related images obtained by the search.

The search unit 11 includes a context encoding unit 111, an image encoding unit 112, and a ranking unit 113. The context encoding unit 111 and the image encoding unit 112 are implemented using a neural network. All arithmetic processing in the neural network is performed based on learned parameters corresponding to each model (neural network).

The context encoding unit 111 inputs a character string constituting an arbitrary sentence as a search query Q, and outputs (generates) a feature quantity u of the search query based on the parameters of the learned model as the context encoding unit 111. do. A specific neural network model as the context encoding unit 111 is not limited to a specific model as long as it encodes text information. For example, the text encoder model used in Non-Patent Document 1 may be used. Although this model inputs text and outputs d-dimensional feature amounts, other models may be used as long as they are context-sensitive pre-learning models using transformers.

The image encoding unit 112 inputs each search target image I _i constituting the search target image set, and calculates the feature quantity v _i of the search target image I _i based on the parameters of the trained model as the image encoding unit 112. Output (generate). However, i=0, 1,..., m. A specific neural network model for the image encoding unit 112 is not limited to a specific model as long as it receives an image as an input and outputs a d-dimensional vector. However, the dimensions d of the output vectors of the context encoding unit 111 and the image encoding unit 112 need to match. For example, the image encoder model used in Non-Patent Document 1 may be used. In Non-Patent Document 1, ResNet and ViT are prepared as models that input an image and output a d-dimensional feature amount. These or other models may be used.

The ranking unit 113 includes a feature quantity u outputted from the context encoding unit 111 regarding the search query Q, and a set of feature quantities v _i outputted from the image encoding unit 112 regarding each search target image I _i (hereinafter referred to as “feature ) {v ₁ ,..., v _i ,..., v _m } is input, and an ordered set of related images of the search query Q {I ₁ ,..., I _i ,..., I _k } and each The degree of association {S ₁ , ..., S _i , ..., S _k } with respect to Q of the related images is output.

The degree of relevance S _i between a certain image I _i and the search query Q is calculated as S _i =f(u, v _i ) using an appropriate distance function f. As a specific implementation example, f may be the reciprocal of the inner product of u and v _i . The inputs of f are two vectors of the same dimension, and the output is a scalar. Alternatively, in addition to the reciprocal of the inner product, a distance function that can measure the distance between vectors may be used as f.

The model learning unit 12 learns model parameters of the context encoding unit 111 and the model learning unit 12.

As preparation before learning, training data for the search task is collected in advance. For example, "Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The new data Data collected in "In multimedia research. Communications of the ACM, 59(2):64-73, 2016" will be used. The learning data is composed of a text set T={T ₀ , T ₁ , . . . , T_c} and an image set I={I ₀ , I ₁ , . . . , I _m ′}.

Furthermore, the image set I _i that is related to the text T _i (assumed to be a positive example in relation to the text T _i ) = {I _j | I _j : document related to T _i } is labeled as correct data. It is attached. For these data sets collected in advance, one image (positive example image) randomly extracted from the image set I _i related to each T _i is set as I _i ⁺ , and (T _i , I _i ⁺ ) is Learning data is created as one set. In other words, the learning data prepared in advance is a set of text and positive example images.

The model learning unit 12 updates the model parameters of the context encoding unit 111 and the image encoding unit 112 through supervised learning using such learning data. It is assumed that the model parameters of the context encoding unit 111 and the image encoding unit 112 are initialized in advance with appropriate initial values (when using the model structure of Non-Patent Document 1, the model parameters of the context encoding unit 111 and the image encoding unit 112 are Parameters of an existing trained model may be used as parameters.)

The model learning unit 12 updates the model parameters based on all the learning data, and repeats this an arbitrary number of times (this repeated process is called an "epoch", and the number of repetitions is called the "epoch number"). The parameter updating method may be similar to general learning of neural networks.

Hereinafter, the processing procedure executed by the search device 10 regarding model learning will be described. FIG. 3 is a flowchart for explaining an example of the processing procedure of the model parameter learning process.

In step S101, the model learning unit 12 randomly divides a plurality of learning data into a plurality of mini-batches.

Subsequently, the model learning unit 12 executes loop processing L1 for every mini-batch. The mini-batch to be processed in the loop processing L1 is referred to as a "target batch." In the loop process L1, the model learning unit 12 executes steps S102 to S109 for the target batch.

In step S102, by inputting the text T _i of each learning data included in the target batch to the context encoding unit 111, a vector (feature amount u _i ) generated by the context encoding unit 111 is obtained for each learning data concerned. . For example, if the size of the mini-batch is 10, 10 feature quantities u _i are obtained.

Next, the model learning unit 12 inputs the image I _i ⁺ of each learning data included in the target batch to the image encoding unit 112, thereby generating a vector (feature value v _i ⁺ ) generated by the image encoding unit 112. is obtained for each learning data (S103). For example, if the mini-batch size is 10, 10 feature quantities v _i ⁺ are obtained.

Next, for each image of the learning data included in the target batch, the model learning unit 12 calculates the result for any text T ⁻ ∈T′−{T _i }, where T′ is the text set in the target batch. Then, one character string s _k that is an arbitrary noun included in T ^- and not included in T _i is selected (S104). Here, with respect to certain learning data, T _i is the correct text. Therefore, regarding the training data, T'-{T _i } is a set of texts obtained by removing the correct text from the set of texts in the target batch. ^T- is the text of one of the sets. Therefore, T ^- is one text other than the correct text (unrelated to the correct image) among the text sets in the target batch. That is, the character string s _k , which is an arbitrary noun included in T ^- and not included in T _i , is one noun that is not included in the correct text.

For example, if the image I _i ⁺ (correct image) of certain training data is an image of a dog, the text T _i of the training data is "A dog is running around in the park", and the size is 10 in a mini-batch. Let T' be the text set,
(1) A dog is running around in the park (2) A cat is lying down in the garden (3)...
:
(10)...
Suppose that In this case, T'-{T _i } is the nine texts (2) to (10) of the above T'. T ^- is one text (for example, "The cat is lying in the garden") arbitrarily (randomly) selected from (2) to (10). The character string s _k is, for example, a noun included in "The cat is lying in the garden" and not included in "The dog is running around in the park." However, s _k is not selected from T ^- , but is selected from among nouns included in the entire vocabulary set (a vocabulary set not limited to learning data) that is not included in T _i . may be done.

Note that the character string _sk may be selected randomly. Alternatively, a word vector may be obtained in advance using FastText or the like, and the word having the longest average distance from the word of the correct text in the space of the word vector may be selected as the character string s _k .

Next, for each target batch of learning data, the model learning unit 12 converts the character string s _k associated with the learning data into _{I i} ⁺ with a random font size that does not exceed the image size of the image I _i ⁺ of the learning data. By embedding (superimposing) the negative example image in the _{copy of} ⁽ S105). Several candidates for the font size may be given in advance, or a minimum size may be defined, and then the font size may be determined continuously at random within that range. Further, I _i ^- may be generated in plural numbers or one for one learning data. When a plurality of I _i ^-s are generated, a plurality of character strings s _k may be selected for one learning data.

Next, the model learning unit 12 inputs the image I _ik ⁻ related to each learning data of the target batch to the image encoding unit 112, and the feature amount v _i k generated by the image encoding unit 112 for each I _ik ⁻ ^- is obtained (S106).

At this point, the target batch is
Text feature set: {u ₁ , u ₂ ,..., u _b } (where b is the mini-batch size)
Image feature set: {v ₁ ⁺ , v ₂ ⁺ ,..., v _b ⁺ , v ₁₁ ^- ,..., v _bl ^- } (where l is the number of negative example images)
becomes.

Next, the model learning unit 12 calculates the output value of the softmax function of the inner product of text to image and image to text for each of the positive and negative examples (S107).

This calculation method will be explained using figures. FIG. 4 is a diagram for explaining a method of calculating the output value and loss of the softmax function of text and image for each of the positive and negative examples. In FIG. 4, text feature sets {u ₁ , u ₂ , ..., u _b } are arranged in columns, and image feature sets {v ₁ ⁺ , v ₂ ⁺ , ..., v _b ⁺ , v ₁₁ ⁻ , ..., v _bl ⁻ } are arranged in the row direction.

The model learning unit 12 calculates, for text from a certain image, the feature quantity v _i of the image and one feature quantity u _j of the text feature quantity set {u ₁ , u ₂ , ..., u _b }. Calculate the inner product as follows.
x _i j=v _i・u _j
By calculating the inner product for each feature u _i ,
x _i1 , x _i2 ,..., x _ib
is obtained. The model learning unit 12 calculates the output value of the softmax function (hereinafter referred to as "softmax output value") for each inner product of one row in FIG. 4 based on x _i1 , x _i2 , ..., x _ib do. By executing this for the feature set {v ₁ ⁺ , v ₂ ⁺ , ..., v _b ⁺ } of the image of the positive example, the softmax output value corresponding to the inner product of each row of the positive example in Figure 4 is calculated. Ru.

In addition, the model learning unit 12 calculates an inner product between the feature quantity uj of a certain text and the image feature quantity set {v ₁ ⁺ , v ₂ ⁺ , ..., v _b ⁺ , v ₁₁ ^- , ..., v _bl ^- } and the softmax output value are calculated in the same way. By executing this for the text feature set {u ₁ , u ₂ , ..., u _b }, the softmax output value (from text to image) corresponding to the inner product of all columns in Figure 4 is calculated. .

Next, the model learning unit 12 calculates a loss (softmax cross entropy loss) for each row and each column of the positive example in FIG. 4 based on the set of calculated softmax output values, and calculates the average or sum of these losses. is calculated as the loss in the target batch (S108). However, in the loss calculation, the class label (correct label) for the softmax output value from the image to the text (in the row direction in FIG. 4) is set to 1 when i=j, as shown in FIG. All other values are set to 0. Also, as shown in FIG. 4, the class label (correct label) for the softmax output value from text to image (in the column direction in FIG. 4) is 1 only where i=j in v _j ⁺ . be done.

Specifically, the cross-entropy loss function is
H(p,q)=-Σp(x)logq(x)
It is. Here p(x) is the true distribution and q(x) is the predicted distribution. A class label is applied to this true distribution and a softmax output value is fitted to the predicted distribution to calculate the loss for each row or column of positive examples. Note that the average or total sum of losses in the row direction is the image loss, and the average or total sum of the losses in the column direction is the text loss. The average or sum of the image loss and text loss is taken as the loss in the target batch.

Note that when using an existing trained model as the initial parameter of the model, the loss may be calculated using only the softmax output value of the text to image.

Next, the model learning unit 12 updates the model parameters of the context encoding unit 111 and the image encoding unit 112 based on the loss in the target model (S108). Specifically, the model learning unit 12 calculates a gradient for each model parameter from the loss using the back error propagation method, and updates the model parameter using an arbitrary optimization method.

When loop L1 is executed for all mini-batches, model learning unit 12 determines whether a predetermined termination condition is satisfied (S110). If the termination condition is not satisfied (No in S110), the model learning unit 12 repeats steps S101 and subsequent steps. When the termination condition is satisfied (Yes in S110), the model learning unit 12 terminates the process of FIG. 3.

As described above, according to the present embodiment, embedding models for text and images are learned using images in which character strings that are negative examples (irrelevant) of relevance to text are embedded. . Therefore, the influence of irrelevant character strings embedded in images can be reduced for text and image embedding models. As a result, it is possible to learn a model that is robust against, for example, adversarial character embedding attacks.

However, if too much learning is performed to reduce the influence of characters in the image, a model will be trained that is less susceptible to the influence of necessary information. Therefore, in this embodiment, learning is also performed for positive examples. This reduces the influence of extraneous characters while preserving the ability to recognize naturally embedded characters.

Regarding the above embodiments, the following additional notes are further disclosed.

(Additional note 1)
memory and
at least one processor connected to the memory;
including;
The processor includes:
Obtain features of multiple texts using the first model,
For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. and the feature amount using the second model,
A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. update parameters,
A learning device characterized by:

(Additional note 2)
Obtain features of multiple texts using the first model,
For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. and the feature amount using the second model,
A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. update parameters,
A recording medium that records a program that causes a computer to execute a process.

Note that in this embodiment, the search device 10 is an example of a learning device. The model learning unit 12 is an example of a first acquisition unit, a second acquisition unit, and a learning unit.

Although the embodiments of the present invention have been described in detail above, the present invention is not limited to these specific embodiments, and various modifications can be made within the scope of the gist of the present invention as described in the claims. - Can be changed.

10 Search device 11 Search section 12 Model learning section 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 111 Context encoding section 112 Image encoding section 113 Ranking section B bus

Claims

a first acquisition unit that acquires feature quantities of a plurality of texts using a first model;
For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a second acquisition unit that acquires the feature amount using a second model;
A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. a learning section that updates parameters;
A learning device characterized by having.
The learning unit calculates, for each text, an inner product of the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and calculates the inner product of the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and calculate the inner product of the image and each of the text features, and calculate the cross-entropy loss of the output value of the softmax function of the inner product for each text, and the output value of the softmax function of the inner product for each of the first images. calculating the loss based on the cross-entropy loss of
The learning device according to claim 1, characterized in that:
The second acquisition unit is configured to acquire, for each of the texts, the second image in which a character string not included in the text is embedded in the first image, which is a positive example of the relevance of the text. generate,
The learning device according to claim 1 or 2, characterized in that:
a first acquisition procedure of acquiring feature quantities of a plurality of texts using a first model;
For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a second acquisition procedure of acquiring the feature amount using a second model;
A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. A learning procedure for updating parameters;
A learning method characterized by being carried out by a computer.
a first acquisition procedure of acquiring feature quantities of a plurality of texts using a first model;
For each text, the feature amount of a first image that is a positive example of the relevance of the text, and the feature amount of a second image in which a character string not included in the text is embedded in the first image. a second acquisition procedure of acquiring the feature amount using a second model;
A loss is calculated based on the feature amount of the text, the feature amount of the first image, and the feature amount of the second image, and the loss of the first model and the second model is calculated based on the loss. A learning procedure for updating parameters;
A program that causes a computer to execute.