WO2023281659A1

WO2023281659A1 - Learning device, estimation device, learning method, and program

Info

Publication number: WO2023281659A1
Application number: PCT/JP2021/025614
Authority: WO
Inventors: 翔太折橋; 亮増村
Original assignee: 日本電信電話株式会社
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2023-01-12
Also published as: JPWO2023281659A1

Abstract

A learning device according to the present invention includes a prior learning unit and a fine tuning unit. The prior learning unit acquires an auxiliary language data set, which is a data set of paired data of an image and text in a language which is not a target language (hereinafter, the auxiliary language), and a target language data set, which is a data set of paired data of an image and text in the target language; trains an intermediate language information recognition model; and outputs a parameter of the intermediate language information recognition model as a trained parameter. The fine tuning unit acquires the target language data set and the trained parameter, and trains a language information recognition model using the trained parameter as an initial value for the language information recognition model.

Description

LEARNING DEVICE, ESTIMATION DEVICE, LEARNING METHOD, PROGRAM

The present invention relates to a learning device, an estimation device, a learning method, and a program.

Scene images obtained by shooting landscapes contain a lot of text information necessary for understanding the images, such as traffic signs and advertising billboards. Scene character recognition is a task of inputting an image obtained by extracting a character region from such a scene image, recognizing the captured character, and converting it into a character string that can be processed by a machine. In recent years, with the progress of deep learning technology, a method for realizing scene character recognition using a continuous model has been proposed.

For example, Non-Patent Literature 1 and Non-Patent Literature 2 provide scene character recognition technology using a model consisting of an encoder and a decoder, as schematically shown in FIG. At this time, the encoder is composed of, for example, a function of extracting image features using a convolutional neural network, and a function of transforming into features considering sequences by the Transformer encoder provided in Reference Non-Patent Document 1 (Reference Non-Patent Document 1: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS ), pp. 5998-6008, 2017.). The decoder is composed of, for example, the Transformer decoder provided in Reference 1, and an autoregressive model using an embedding layer and an output layer, and outputs the character string generation probability from the features extracted by the encoder. In learning, by using a large amount of image-text pair data in the target language as training data, the parameters of the encoder and decoder can be optimized by, for example, error backpropagation.

　When learning a scene character recognition model as realized by Non-Patent Document 1 and Non-Patent Document 2, a large amount of paired data of images and text in the target language is required as training data. For example, when the target language is a language with many speakers and abundant resources, such as English, collection of pair data is relatively easy because a large number of available pair data are provided. However, when targeting a language such as Japanese that has few speakers, there is a problem that there is little provision of easily available pair data. In particular, learning a complex model such as Transformer shown in Reference 1 requires a large amount of teacher data, but it is difficult to collect a large amount of pair data in a language with few resources.

Therefore, the object of the present invention is to provide a learning device that can learn a highly accurate scene character recognition model even when there is little pair data of the target language.

The learning device of the present invention includes a pre-learning section and a fine-tuning section.

The pre-learning part consists of an auxiliary language dataset that is a dataset of image-text pair data in a language that is not the target language (hereinafter referred to as an auxiliary language), and a target language that is a dataset of image-text pair data in the target language A data set is acquired, an intermediate language information recognition model is learned, and the parameters of the intermediate language information recognition model are output as learned parameters. The fine-tuning unit acquires the target language data set and the learned parameters, sets the learned parameters as initial values of the language information recognition model, and learns the language information recognition model.

According to the learning device of the present invention, it is possible to learn a highly accurate scene character recognition model even when there is little pair data of the target language.

FIG. 4 is a diagram showing an example of a conventional character recognition model; FIG. 2 is a block diagram showing the functional configuration of the learning device according to the first embodiment; FIG. 4 is a flow chart showing the operation of the learning device according to the first embodiment; FIG. 2 is a block diagram showing the configuration of the estimation device according to the first embodiment; FIG. FIG. 11 is a block diagram showing the functional configuration of the learning device according to the second embodiment; 10 is a flowchart showing the operation of the learning device of Example 2; FIG. 2 is a block diagram showing the configuration of an estimation device according to a second embodiment; FIG. FIG. 11 is a block diagram showing the functional configuration of the learning device of Example 3; 10 is a flowchart showing the operation of the learning device of Example 3; FIG. 11 is a block diagram showing the configuration of an estimation device according to a third embodiment; The figure which shows the result of a verification experiment. 4A and 4B are diagrams showing examples of recognition results; FIG. The figure which shows the functional structural example of a computer.

In the following examples, an example of application to scene character recognition is to input an image obtained by cutting out a character part from a scene image, such as Non-Patent Document 1 or Non-Patent Document 2, and output a corresponding character string. , the invention is not limited to scene character recognition. In other words, the present invention can be applied to general techniques for outputting character strings from an image through an arbitrary encoder/decoder type one-through model. For example, the present invention can be applied to optical character recognition, caption generation for still and moving images, lip reading, and the like. Therefore, the model handled by the present invention is a model (linguistic information recognition model) handling general tasks for recognizing linguistic information from images. Specific examples of linguistic information recognition models include a character recognition model that recognizes images in which characters are displayed, such as optical character recognition, a caption generation model that generates text corresponding to the content of still images and moving images, and a mouth clipping model. There are the lip-reading model, which estimates spoken text from captured video, and the above-mentioned scene character recognition model. In the following examples, a detailed description will focus on the scene character recognition model.

In the following examples, even if the dataset of paired data in the target language (hereinafter referred to as the target language dataset) is small, in order to realize character recognition with high accuracy, in a language that is not the target language (hereinafter referred to as the auxiliary language) Utilize paired data datasets (hereinafter referred to as auxiliary language datasets). Here, the auxiliary language dataset is used for pre-training rather than simply being mixed with the target language dataset. In particular, by pre-training the encoder, which is responsible for extracting features from images, with a large-scale dataset consisting of the target language dataset and auxiliary language dataset, and pre-training the decoder, which is responsible for character string output, with the target language dataset, Improve the accuracy of scene character recognition for the target language.

The functional configuration of the learning device of the first embodiment will be described below with reference to FIG. As shown in the figure, the learning device 1 of this embodiment includes a pre-learning section 11 and a fine-tuning section 12 . The auxiliary language data set storage unit 91, the target language data set storage unit 92, and the scene character recognition model storage unit 93 shown in FIG. good. FIG. 3 shows the operation of the learning device 1 of this embodiment. The operation of the learning device 1 of this embodiment includes pre-learning processing and fine-tuning processing.

<Pre-learning processing>
The pre-learning unit 11 acquires an auxiliary language data set from the auxiliary language data set storage unit 91, acquires a target language data set from the target language data set storage unit 92, learns an intermediate scene character recognition model, and recognizes an intermediate scene. The parameters of the character recognition model are output as learned parameters (S11). The process of "learning an intermediate scene character recognition model and outputting parameters of the intermediate scene character recognition model as learned parameters" is called pre-learning process. Also, the term “learning” here refers to training a model so that an image of paired data included in a data set is input and text is output.

The pre-learning unit 11 provides pairs of images and character strings included in a dataset obtained by mixing the target language dataset and the auxiliary language dataset as input/output correct data pairs of the scene character recognition model, and calculates the error Any learning method, such as backpropagation, may be used to optimize the parameters.

The scene character recognition model is a continuous scene character recognition model with an encoder/decoder structure, as shown in Non-Patent Document 1 and Non-Patent Document 2, for example. At this time, the encoder may be configured with, for example, a function of extracting image features using a convolutional neural network, and a function of transforming into features considering sequences using a Transformer encoder provided in Reference Non-Patent Document 1. Also, the decoder may be composed of, for example, a transformer decoder provided in Reference Non-Patent Document 1, and an autoregressive model using an embedding layer and an output layer. Note that the convolutional neural network of the encoder may be trained using parameters learned in advance in any task such as object recognition as initial values, and other parts such as the Transformer encoder are also learned separately in advance. It is also possible to learn using the parameters obtained as initial values.

The target language dataset is a dataset of paired data in the target language to be estimated. For example, it may be a language with few speakers and few data resources, such as Japanese. On the other hand, the auxiliary language data set is a data set of paired data in a language other than the target language, and may be a language such as English that has many speakers and many data resources. However, this is only an example, and the present invention is not limited to a case where the number of data in the target language dataset is small and the number of data in the auxiliary language dataset is large. Furthermore, both the target language and the auxiliary language need not be a single language, and either or both of the target language and the auxiliary language may be a set of multiple languages. The vocabulary dictionary output by the scene character recognition model can be a dictionary that covers the vocabulary of both the target language and the auxiliary language.

<Fine tuning process>
The fine tuning unit 12 acquires the target language data set from the target language data set storage unit 92, acquires the learned parameters of the intermediate scene character recognition model from the pre-learning unit 11, and sets the learned parameters of the intermediate scene character recognition model. is the initial value of the scene character recognition model, the scene character recognition model is learned, and the scene character recognition model is output (S12). The output scene character recognition model is stored in the scene character recognition model storage unit 93 . The process of "learning the scene character recognition model using the learned parameters of the intermediate scene character recognition model as the initial values of the scene character recognition model and outputting the scene character recognition model" is called fine-tuning process.

The fine-tuning unit 12 provides pairs of images and character strings included in the target language data set as pairs of input/output correct data for the scene character recognition model, and uses an arbitrary learning method such as error backpropagation to obtain parameters. should be optimized. At this time, learning may be performed after deleting the auxiliary language vocabulary and related parameters from the vocabulary dictionary output by the scene character recognition model.

As mentioned above, in order to improve the accuracy of character recognition in the target language, the auxiliary language dataset is used to generate a scene image based on a large-scale dataset that is a mixture of the target language pair data and the auxiliary language pair data. You can pre-train a character recognition model. This makes it possible to improve the accuracy of character recognition in the target language even when there is little pair data in the target language.

<Estimation device 100>
As shown in FIG. 4, the estimating device 100 of this embodiment receives an image as an input, estimates a character string corresponding to the image using the scene character recognition model learned by the learning device 1 described above, and obtains an estimation result to output

The functional configuration of the learning device of the second embodiment will be described below with reference to FIG. As shown in the figure, the learning device 2 of this embodiment includes an encoder pre-learning section 21 and a fine tuning section 22 . The auxiliary language data set storage unit 91, the target language data set storage unit 92, and the scene character recognition model storage unit 93 shown in FIG. good. FIG. 6 shows the operation of the learning device 2 of this embodiment. The operation of the learning device 2 of this embodiment includes encoder pre-learning processing and fine-tuning processing.

<Encoder pre-learning processing>
The encoder pre-learning unit 21 acquires an auxiliary language data set from the auxiliary language data set storage unit 91, acquires a target language data set from the target language data set storage unit 92, and performs intermediate scene character recognition configured by an encoder and a decoder. The model is learned, and the parameters of only the encoder portion are output as the learned parameters of the intermediate encoder (S21). The process of "learning an intermediate scene character recognition model composed of an encoder and a decoder, extracting parameters of only the encoder part, and outputting them as learned parameters of the intermediate encoder" is called encoder pre-learning process.

The encoder pre-learning unit 21 can use the same method as the pre-learning process in the first embodiment, and the combination of the image and the character string included in the data set obtained by mixing the target language data set and the auxiliary language data set is given as a set of input/output correct data of the scene character recognition model, and the parameters are optimized by an arbitrary learning method such as error backpropagation. At this time, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of both the target language and the auxiliary language. Note that the intermediate encoder does not necessarily have to be the entire encoder shown in FIG. It may be up to the middle layer of the Transformer encoder consisting of layers.

<Fine tuning process>
The fine tuning unit 22 acquires the target language data set from the target language data set storage unit 92, acquires the intermediate encoder from the encoder pre-learning unit 21, and converts the learned parameters of the intermediate encoder to the initial values of the encoder of the scene character recognition model. , a scene character recognition model is learned from the target language data set, and the scene character recognition model is output (S22). The output scene character recognition model is stored in the scene character recognition model storage unit 93 . The process of "giving the learned parameters of the intermediate encoder as the initial values of the encoder of the scene character recognition model, learning the scene character recognition model with the target language data set, and outputting the scene character recognition model" is called the fine tuning process. call.

The fine-tuning unit 22 can use the same method as the fine-tuning process in the first embodiment. and optimize the parameters by any learning method such as error backpropagation. Here, the initial values of the parts not included in the intermediate encoder, including the decoder, may be given by any initialization method such as random initialization. Also, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language.

As described above, pre-training the encoder with a large dataset that is a mixture of the auxiliary language dataset and the target language dataset increases the number of images that the encoder learns. As a result, the encoder, which has the function of extracting image features, can be trained using data with various character shapes and background images, so the robustness of the encoder to input images can be improved.

<Estimation device 200>
As shown in FIG. 7, the estimating device 200 of the present embodiment receives an image as an input, estimates a character string corresponding to the image using the scene character recognition model learned by the learning device 2 described above, and obtains an estimation result to output

The functional configuration of the learning device of Example 3 will be described below with reference to FIG. As shown in the figure, the learning device 3 of this embodiment includes an encoder pre-learning section 31, a decoder pre-learning section 32, and a fine tuning section 33. FIG. The auxiliary language data set storage unit 91, the target language data set storage unit 92, and the scene character recognition model storage unit 93 shown in FIG. good. FIG. 9 shows the operation of the learning device 3 of this embodiment. The operation of the learning device 3 of this embodiment includes an encoder pre-learning process, a decoder pre-learning process, and a fine tuning process.

<Encoder pre-learning processing>
As in the second embodiment, the encoder pre-learning unit 31 acquires the auxiliary language dataset from the auxiliary language dataset storage unit 91, acquires the target language dataset from the target language dataset storage unit 92, and uses the encoder and decoder to The constructed intermediate scene character recognition model is learned, and only the encoder portion is cut out and output as an intermediate encoder (S31). The process of "learning an intermediate scene character recognition model composed of an encoder and a decoder, cutting out only the encoder part, and outputting it as an intermediate encoder" is called encoder pre-learning process.

<Decoder pre-learning process>
Decoder pre-learning unit 32 receives the target language data set as input, learns a second intermediate scene character recognition model composed of an encoder and a decoder, and extracts parameters only for the decoder part of the second intermediate scene character recognition model. are output as learned parameters of the intermediate decoder (S32). It should be noted that "a second intermediate scene character recognition model composed of an encoder and a decoder is learned, parameters of only the decoder portion of the second intermediate scene character recognition model are extracted, and output as learned parameters of the intermediate decoder". The processing is called decoder pre-learning processing.

The decoder pre-learning unit 32 gives pairs of images and character strings included in the target language data set as pairs of correct data for input and output of the scene character recognition model, and uses an arbitrary learning method such as error backpropagation to obtain parameters should be optimized. At this time, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language. Note that the intermediate decoder does not necessarily have to be the entire decoder shown in FIG. good. Furthermore, only the embedded layer and the output layer may be intermediate decoders.

<Fine tuning process>
The fine-tuning unit 33 receives the target language data set, the intermediate encoder, and the intermediate decoder as inputs, and gives the learned parameters of the intermediate encoder and the intermediate decoder as initial values for the encoder and decoder of the scene character recognition model, respectively. Then, the scene character recognition model is learned and the scene character recognition model is output (S33). It should be noted that ``The learned parameters of the intermediate encoder and intermediate decoder are given as the initial values of the encoder and decoder of the scene character recognition model, respectively, the scene character recognition model is learned with the target language data set, and the scene character recognition model is output.'' The process is called fine-tuning process. The output scene character recognition model is stored in the scene character recognition model storage unit 93 .

The fine-tuning unit 33 can use the same method as the fine-tuning process in the first embodiment. and optimize the parameters by any learning method such as error backpropagation. Here, the initial values of the portions not included in the intermediate encoder and intermediate decoder may be given by any initialization method such as random initialization. Also, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language.

As described above, pre-training the decoder with the dataset of the target language allows the decoder to specialize in the target language. Combining decoder pre-training and encoder pre-training can effectively improve the accuracy of scene character recognition in the target language.

<Estimation device 300>
As shown in FIG. 10, the estimating device 300 of this embodiment receives an image as an input, uses the scene character recognition model learned by the learning device 3 to estimate a character string corresponding to the image, and obtains an estimation result to output

<effect>
With the learning apparatus described in the above embodiments, the encoder is pre-trained using not only the target language data set, but also the auxiliary language data set. As a result, the number of images that the encoder learns from increases, so the encoder, which is responsible for extracting image features, can be trained using data with a variety of character shapes and background images. Improves robustness. Additionally, the decoder is pre-trained using the target language dataset. As a result, the decoder is specialized for the target language, and by combining with pre-training of the encoder, it is possible to effectively improve the accuracy of scene character recognition in the target language.

Verification experiments were conducted on the scene character recognition model with the structure described in Non-Patent Document 1 and Non-Patent Document 2. The target language is Japanese, and about 70,000 sets of pair data were prepared as training data. English was used as the auxiliary language, and about 8,000,000 sets of paired data were prepared as training data. The teacher data includes the baseline learned from scratch using the target language dataset without using the auxiliary language dataset or pre-learning, and the case of learning using the learning methods according to Examples 1, 2, and 3. We verified the accuracy of character recognition for approximately 7,000 images of target languages that cannot be read. For the evaluation, the accuracy rate by perfect match was used. The results of verification experiments are shown in FIG. As shown in the figure, even with the learning method of Example 1 in which learning is performed simply using the auxiliary language data set, the accuracy rate is improved compared to the baseline. Furthermore, as in Example 3, pre-learning the encoder with the auxiliary language data set and the target language data set and pre-learning the decoder with the target language data set further improves the accuracy rate.

　Fig. 12 shows an example of recognition results in the scene character recognition model with the structure described in Non-Patent Document 2. As shown in (1) to (3) of FIG. 12, it can be seen that erroneous recognition is prevented by performing the pre-learning process according to the first embodiment. In addition, as shown in (4) of the same figure, it can be seen that the decoder pre-learning process according to the third embodiment can prevent the recognition of words and phrases that do not hold true.

<Addendum>
The apparatus of the present invention includes, for example, a single hardware entity, which includes an input unit to which a keyboard can be connected, an output unit to which a liquid crystal display can be connected, and a communication device (for example, a communication cable) capable of communicating with the outside of the hardware entity. can be connected to the communication unit, CPU (Central Processing Unit, which may include cache memory, registers, etc.), memory RAM and ROM, external storage device such as hard disk, input unit, output unit, communication unit , a CPU, a RAM, a ROM, and a bus for connecting data to and from an external storage device. Also, if necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. A physical entity with such hardware resources includes a general purpose computer.

The external storage device of the hardware entity stores a program necessary for realizing the functions described above and data required for the processing of this program (not limited to the external storage device; It may be stored in a ROM, which is a dedicated storage device). Data obtained by processing these programs are appropriately stored in a RAM, an external storage device, or the like.

In the hardware entity, each program stored in an external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and interpreted, executed and processed by the CPU as appropriate. . As a result, the CPU realizes a predetermined function (each component expressed as above, . . . unit, . . . means, etc.).

The present invention is not limited to the above-described embodiments, and modifications can be made as appropriate without departing from the scope of the present invention. Further, the processes described in the above embodiments are not only executed in chronological order according to the described order, but may also be executed in parallel or individually according to the processing capacity of the device that executes the processes or as necessary. .

As described above, when the processing functions of the hardware entity (apparatus of the present invention) described in the above embodiments are implemented by a computer, the processing contents of the functions that the hardware entity should have are described by a program. By executing this program on a computer, the processing functions of the hardware entity are realized on the computer.

The various processes described above are performed by loading a program for executing each step of the above method into the recording unit 10020 of the computer 10000 shown in FIG. can.

A program that describes this process can be recorded on a computer-readable recording medium. Any computer-readable recording medium may be used, for example, a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, or the like. Specifically, for example, as magnetic recording devices, hard disk devices, flexible disks, magnetic tapes, etc., as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. as magneto-optical recording media, such as MO (Magneto-Optical disc), etc. as semiconductor memory, EEP-ROM (Electrically Erasable and Programmable-Read Only Memory), etc. can be used.

In addition, the distribution of this program is carried out, for example, by selling, assigning, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Further, the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.

A computer that executes such a program, for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition. may be It should be noted that the program in this embodiment includes information that is used for processing by a computer and that conforms to the program (data that is not a direct instruction to the computer but has the property of prescribing the processing of the computer, etc.).

Also, in this embodiment, a hardware entity is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.

Regarding the above embodiments, the following additional remarks are disclosed.

(Appendix 1)
memory;
at least one processor connected to the memory;
including
The processor
Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. to learn an intermediate language information recognition model, output parameters of the intermediate language information recognition model as learned parameters,
A learning device that acquires the target language data set and the learned parameter, sets the learned parameter as an initial value of the language information recognition model, and learns the language information recognition model.
(Appendix 2)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes
Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. to learn an intermediate language information recognition model, output parameters of the intermediate language information recognition model as learned parameters,
A non-temporary storage medium that acquires the target language data set and the learned parameters, sets the learned parameters as initial values of a language information recognition model, and learns the language information recognition model.
(Appendix 3)
The learning device according to appendix 1,
The intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder,
outputting parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder;
A learning device that provides the learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model.
(Appendix 4)
The non-transitory storage medium according to appendix 2,
The intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder,
outputting parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder;
A non-temporary storage medium that provides learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model.
(Appendix 5)
The learning device according to additional item 3,
outputting parameters of only the decoder portion of the intermediate language information recognition model as learned parameters of the intermediate decoder;
A learning device that provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model.
(Appendix 6)
The non-transitory storage medium according to appendix 4,
outputting parameters of only the decoder portion of the intermediate language information recognition model as learned parameters of the intermediate decoder;
A non-temporary storage medium that provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model.
(Appendix 7)
memory;
at least one processor connected to the memory;
including
The processor
An estimation device for estimating a character string corresponding to an image by using a language information recognition model learned by the learning device according to any one of

supplementary items

1, 3, and 5 using an image as an input.
(Appendix 8)
A non-temporary storage medium for estimating a character string corresponding to the image using a language information recognition model learned by the non-temporary storage medium according to any one of

additional items

2, 4, and 6.

Claims

Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. a pre-learning unit that learns an intermediate language information recognition model and outputs parameters of the intermediate language information recognition model as learned parameters;
A learning device including a fine-tuning unit that acquires the target language data set and the learned parameters, sets the learned parameters as initial values of a language information recognition model, and learns the language information recognition model.
The learning device according to claim 1,
The intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder,
The pre-learning unit outputs parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder,
The fine-tuning unit provides the learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model. A learning device.
The learning device according to claim 2,
The pre-learning unit
learning a second intermediate language information recognition model using the target language data set as input, outputting parameters of only the decoder part of the second intermediate language information recognition model as trained parameters of the intermediate decoder;
The fine-tuning unit provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model. Learning device.
An estimation device that receives an image as an input and estimates a character string corresponding to the image by using a language information recognition model learned by the learning device according to any one of claims 1 to 3.
A learning method executed by a learning device,
Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. a step of learning an intermediate language information recognition model and outputting parameters of the intermediate language information recognition model as learned parameters;
A learning method comprising the step of acquiring the target language data set and the learned parameters, setting the learned parameters as initial values of a language information recognition model, and learning the language information recognition model.
A program that causes a computer to function as the learning device according to any one of claims 1 to 3.
A program that causes a computer to function as the estimation device according to claim 4.