WO2023281659A1 - Learning device, estimation device, learning method, and program - Google Patents

Learning device, estimation device, learning method, and program Download PDF

Info

Publication number
WO2023281659A1
WO2023281659A1 PCT/JP2021/025614 JP2021025614W WO2023281659A1 WO 2023281659 A1 WO2023281659 A1 WO 2023281659A1 JP 2021025614 W JP2021025614 W JP 2021025614W WO 2023281659 A1 WO2023281659 A1 WO 2023281659A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
recognition model
learning
information recognition
encoder
Prior art date
Application number
PCT/JP2021/025614
Other languages
French (fr)
Japanese (ja)
Inventor
翔太 折橋
亮 増村
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/025614 priority Critical patent/WO2023281659A1/en
Priority to JP2023532949A priority patent/JPWO2023281659A1/ja
Publication of WO2023281659A1 publication Critical patent/WO2023281659A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention relates to a learning device, an estimation device, a learning method, and a program.
  • Scene images obtained by shooting landscapes contain a lot of text information necessary for understanding the images, such as traffic signs and advertising billboards.
  • Scene character recognition is a task of inputting an image obtained by extracting a character region from such a scene image, recognizing the captured character, and converting it into a character string that can be processed by a machine.
  • a method for realizing scene character recognition using a continuous model has been proposed.
  • Non-Patent Literature 1 and Non-Patent Literature 2 provide scene character recognition technology using a model consisting of an encoder and a decoder, as schematically shown in FIG.
  • the encoder is composed of, for example, a function of extracting image features using a convolutional neural network, and a function of transforming into features considering sequences by the Transformer encoder provided in Reference Non-Patent Document 1 (Reference Non-Patent Document 1: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS ), pp.
  • the decoder is composed of, for example, the Transformer decoder provided in Reference 1, and an autoregressive model using an embedding layer and an output layer, and outputs the character string generation probability from the features extracted by the encoder.
  • the parameters of the encoder and decoder can be optimized by, for example, error backpropagation.
  • Non-Patent Document 1 and Non-Patent Document 2 When learning a scene character recognition model as realized by Non-Patent Document 1 and Non-Patent Document 2, a large amount of paired data of images and text in the target language is required as training data. For example, when the target language is a language with many speakers and abundant resources, such as English, collection of pair data is relatively easy because a large number of available pair data are provided. However, when targeting a language such as Japanese that has few speakers, there is a problem that there is little provision of easily available pair data. In particular, learning a complex model such as Transformer shown in Reference 1 requires a large amount of teacher data, but it is difficult to collect a large amount of pair data in a language with few resources.
  • the object of the present invention is to provide a learning device that can learn a highly accurate scene character recognition model even when there is little pair data of the target language.
  • the learning device of the present invention includes a pre-learning section and a fine-tuning section.
  • the pre-learning part consists of an auxiliary language dataset that is a dataset of image-text pair data in a language that is not the target language (hereinafter referred to as an auxiliary language), and a target language that is a dataset of image-text pair data in the target language
  • auxiliary language a dataset of image-text pair data in a language that is not the target language
  • a data set is acquired, an intermediate language information recognition model is learned, and the parameters of the intermediate language information recognition model are output as learned parameters.
  • the fine-tuning unit acquires the target language data set and the learned parameters, sets the learned parameters as initial values of the language information recognition model, and learns the language information recognition model.
  • the learning device of the present invention it is possible to learn a highly accurate scene character recognition model even when there is little pair data of the target language.
  • FIG. 4 is a diagram showing an example of a conventional character recognition model
  • FIG. 2 is a block diagram showing the functional configuration of the learning device according to the first embodiment
  • FIG. 4 is a flow chart showing the operation of the learning device according to the first embodiment
  • FIG. 2 is a block diagram showing the configuration of the estimation device according to the first embodiment
  • FIG. 11 is a block diagram showing the functional configuration of the learning device according to the second embodiment
  • 10 is a flowchart showing the operation of the learning device of Example 2
  • FIG. 2 is a block diagram showing the configuration of an estimation device according to a second embodiment
  • FIG. FIG. 11 is a block diagram showing the functional configuration of the learning device of Example 3
  • 10 is a flowchart showing the operation of the learning device of Example 3
  • FIG. 11 is a block diagram showing the configuration of an estimation device according to a third embodiment
  • the figure which shows the result of a verification experiment. 4A and 4B are diagrams showing examples of recognition results; FIG. The figure which shows the functional structural example of
  • an example of application to scene character recognition is to input an image obtained by cutting out a character part from a scene image, such as Non-Patent Document 1 or Non-Patent Document 2, and output a corresponding character string.
  • the invention is not limited to scene character recognition.
  • the present invention can be applied to general techniques for outputting character strings from an image through an arbitrary encoder/decoder type one-through model.
  • the present invention can be applied to optical character recognition, caption generation for still and moving images, lip reading, and the like. Therefore, the model handled by the present invention is a model (linguistic information recognition model) handling general tasks for recognizing linguistic information from images.
  • linguistic information recognition models include a character recognition model that recognizes images in which characters are displayed, such as optical character recognition, a caption generation model that generates text corresponding to the content of still images and moving images, and a mouth clipping model.
  • a character recognition model that recognizes images in which characters are displayed
  • a caption generation model that generates text corresponding to the content of still images and moving images
  • a mouth clipping model There are the lip-reading model, which estimates spoken text from captured video, and the above-mentioned scene character recognition model. In the following examples, a detailed description will focus on the scene character recognition model.
  • the auxiliary language dataset is used for pre-training rather than simply being mixed with the target language dataset.
  • the encoder which is responsible for extracting features from images
  • auxiliary language dataset a large-scale dataset consisting of the target language dataset and auxiliary language dataset
  • pre-training the decoder which is responsible for character string output, with the target language dataset
  • the learning device 1 of this embodiment includes a pre-learning section 11 and a fine-tuning section 12 .
  • FIG. 3 shows the operation of the learning device 1 of this embodiment.
  • the operation of the learning device 1 of this embodiment includes pre-learning processing and fine-tuning processing.
  • the pre-learning unit 11 acquires an auxiliary language data set from the auxiliary language data set storage unit 91, acquires a target language data set from the target language data set storage unit 92, learns an intermediate scene character recognition model, and recognizes an intermediate scene.
  • the parameters of the character recognition model are output as learned parameters (S11).
  • the process of "learning an intermediate scene character recognition model and outputting parameters of the intermediate scene character recognition model as learned parameters” is called pre-learning process.
  • learning here refers to training a model so that an image of paired data included in a data set is input and text is output.
  • the pre-learning unit 11 provides pairs of images and character strings included in a dataset obtained by mixing the target language dataset and the auxiliary language dataset as input/output correct data pairs of the scene character recognition model, and calculates the error Any learning method, such as backpropagation, may be used to optimize the parameters.
  • the scene character recognition model is a continuous scene character recognition model with an encoder/decoder structure, as shown in Non-Patent Document 1 and Non-Patent Document 2, for example.
  • the encoder may be configured with, for example, a function of extracting image features using a convolutional neural network, and a function of transforming into features considering sequences using a Transformer encoder provided in Reference Non-Patent Document 1.
  • the decoder may be composed of, for example, a transformer decoder provided in Reference Non-Patent Document 1, and an autoregressive model using an embedding layer and an output layer.
  • the convolutional neural network of the encoder may be trained using parameters learned in advance in any task such as object recognition as initial values, and other parts such as the Transformer encoder are also learned separately in advance. It is also possible to learn using the parameters obtained as initial values.
  • the target language dataset is a dataset of paired data in the target language to be estimated.
  • it may be a language with few speakers and few data resources, such as Japanese.
  • the auxiliary language data set is a data set of paired data in a language other than the target language, and may be a language such as English that has many speakers and many data resources.
  • the present invention is not limited to a case where the number of data in the target language dataset is small and the number of data in the auxiliary language dataset is large.
  • both the target language and the auxiliary language need not be a single language, and either or both of the target language and the auxiliary language may be a set of multiple languages.
  • the vocabulary dictionary output by the scene character recognition model can be a dictionary that covers the vocabulary of both the target language and the auxiliary language.
  • the fine tuning unit 12 acquires the target language data set from the target language data set storage unit 92, acquires the learned parameters of the intermediate scene character recognition model from the pre-learning unit 11, and sets the learned parameters of the intermediate scene character recognition model. is the initial value of the scene character recognition model, the scene character recognition model is learned, and the scene character recognition model is output (S12).
  • the output scene character recognition model is stored in the scene character recognition model storage unit 93 .
  • the process of "learning the scene character recognition model using the learned parameters of the intermediate scene character recognition model as the initial values of the scene character recognition model and outputting the scene character recognition model” is called fine-tuning process.
  • the fine-tuning unit 12 provides pairs of images and character strings included in the target language data set as pairs of input/output correct data for the scene character recognition model, and uses an arbitrary learning method such as error backpropagation to obtain parameters. should be optimized. At this time, learning may be performed after deleting the auxiliary language vocabulary and related parameters from the vocabulary dictionary output by the scene character recognition model.
  • the auxiliary language dataset is used to generate a scene image based on a large-scale dataset that is a mixture of the target language pair data and the auxiliary language pair data. You can pre-train a character recognition model. This makes it possible to improve the accuracy of character recognition in the target language even when there is little pair data in the target language.
  • the estimating device 100 of this embodiment receives an image as an input, estimates a character string corresponding to the image using the scene character recognition model learned by the learning device 1 described above, and obtains an estimation result to output
  • the learning device 2 of this embodiment includes an encoder pre-learning section 21 and a fine tuning section 22 .
  • FIG. 6 shows the operation of the learning device 2 of this embodiment.
  • the operation of the learning device 2 of this embodiment includes encoder pre-learning processing and fine-tuning processing.
  • the encoder pre-learning unit 21 acquires an auxiliary language data set from the auxiliary language data set storage unit 91, acquires a target language data set from the target language data set storage unit 92, and performs intermediate scene character recognition configured by an encoder and a decoder.
  • the model is learned, and the parameters of only the encoder portion are output as the learned parameters of the intermediate encoder (S21).
  • the process of "learning an intermediate scene character recognition model composed of an encoder and a decoder, extracting parameters of only the encoder part, and outputting them as learned parameters of the intermediate encoder” is called encoder pre-learning process.
  • the encoder pre-learning unit 21 can use the same method as the pre-learning process in the first embodiment, and the combination of the image and the character string included in the data set obtained by mixing the target language data set and the auxiliary language data set is given as a set of input/output correct data of the scene character recognition model, and the parameters are optimized by an arbitrary learning method such as error backpropagation.
  • the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of both the target language and the auxiliary language.
  • the intermediate encoder does not necessarily have to be the entire encoder shown in FIG. It may be up to the middle layer of the Transformer encoder consisting of layers.
  • the fine tuning unit 22 acquires the target language data set from the target language data set storage unit 92, acquires the intermediate encoder from the encoder pre-learning unit 21, and converts the learned parameters of the intermediate encoder to the initial values of the encoder of the scene character recognition model. , a scene character recognition model is learned from the target language data set, and the scene character recognition model is output (S22). The output scene character recognition model is stored in the scene character recognition model storage unit 93 .
  • the process of "giving the learned parameters of the intermediate encoder as the initial values of the encoder of the scene character recognition model, learning the scene character recognition model with the target language data set, and outputting the scene character recognition model” is called the fine tuning process. call.
  • the fine-tuning unit 22 can use the same method as the fine-tuning process in the first embodiment. and optimize the parameters by any learning method such as error backpropagation.
  • the initial values of the parts not included in the intermediate encoder, including the decoder may be given by any initialization method such as random initialization.
  • the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language.
  • the encoder which has the function of extracting image features, can be trained using data with various character shapes and background images, so the robustness of the encoder to input images can be improved.
  • the estimating device 200 of the present embodiment receives an image as an input, estimates a character string corresponding to the image using the scene character recognition model learned by the learning device 2 described above, and obtains an estimation result to output
  • the learning device 3 of this embodiment includes an encoder pre-learning section 31, a decoder pre-learning section 32, and a fine tuning section 33.
  • FIG. 9 shows the operation of the learning device 3 of this embodiment.
  • the operation of the learning device 3 of this embodiment includes an encoder pre-learning process, a decoder pre-learning process, and a fine tuning process.
  • the encoder pre-learning unit 31 acquires the auxiliary language dataset from the auxiliary language dataset storage unit 91, acquires the target language dataset from the target language dataset storage unit 92, and uses the encoder and decoder to The constructed intermediate scene character recognition model is learned, and only the encoder portion is cut out and output as an intermediate encoder (S31).
  • the process of "learning an intermediate scene character recognition model composed of an encoder and a decoder, cutting out only the encoder part, and outputting it as an intermediate encoder” is called encoder pre-learning process.
  • Decoder pre-learning unit 32 receives the target language data set as input, learns a second intermediate scene character recognition model composed of an encoder and a decoder, and extracts parameters only for the decoder part of the second intermediate scene character recognition model. are output as learned parameters of the intermediate decoder (S32). It should be noted that "a second intermediate scene character recognition model composed of an encoder and a decoder is learned, parameters of only the decoder portion of the second intermediate scene character recognition model are extracted, and output as learned parameters of the intermediate decoder". The processing is called decoder pre-learning processing.
  • the decoder pre-learning unit 32 gives pairs of images and character strings included in the target language data set as pairs of correct data for input and output of the scene character recognition model, and uses an arbitrary learning method such as error backpropagation to obtain parameters should be optimized.
  • the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language.
  • the intermediate decoder does not necessarily have to be the entire decoder shown in FIG. good. Furthermore, only the embedded layer and the output layer may be intermediate decoders.
  • the fine-tuning unit 33 receives the target language data set, the intermediate encoder, and the intermediate decoder as inputs, and gives the learned parameters of the intermediate encoder and the intermediate decoder as initial values for the encoder and decoder of the scene character recognition model, respectively. Then, the scene character recognition model is learned and the scene character recognition model is output (S33). It should be noted that ⁇ The learned parameters of the intermediate encoder and intermediate decoder are given as the initial values of the encoder and decoder of the scene character recognition model, respectively, the scene character recognition model is learned with the target language data set, and the scene character recognition model is output.'' The process is called fine-tuning process. The output scene character recognition model is stored in the scene character recognition model storage unit 93 .
  • the fine-tuning unit 33 can use the same method as the fine-tuning process in the first embodiment. and optimize the parameters by any learning method such as error backpropagation.
  • the initial values of the portions not included in the intermediate encoder and intermediate decoder may be given by any initialization method such as random initialization.
  • the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language.
  • pre-training the decoder with the dataset of the target language allows the decoder to specialize in the target language.
  • Combining decoder pre-training and encoder pre-training can effectively improve the accuracy of scene character recognition in the target language.
  • the estimating device 300 of this embodiment receives an image as an input, uses the scene character recognition model learned by the learning device 3 to estimate a character string corresponding to the image, and obtains an estimation result to output
  • the encoder is pre-trained using not only the target language data set, but also the auxiliary language data set.
  • the encoder which is responsible for extracting image features, can be trained using data with a variety of character shapes and background images.
  • the decoder is pre-trained using the target language dataset. As a result, the decoder is specialized for the target language, and by combining with pre-training of the encoder, it is possible to effectively improve the accuracy of scene character recognition in the target language.
  • Verification experiments were conducted on the scene character recognition model with the structure described in Non-Patent Document 1 and Non-Patent Document 2.
  • the target language is Japanese, and about 70,000 sets of pair data were prepared as training data.
  • English was used as the auxiliary language, and about 8,000,000 sets of paired data were prepared as training data.
  • the teacher data includes the baseline learned from scratch using the target language dataset without using the auxiliary language dataset or pre-learning, and the case of learning using the learning methods according to Examples 1, 2, and 3.
  • Example 3 even with the learning method of Example 1 in which learning is performed simply using the auxiliary language data set, the accuracy rate is improved compared to the baseline. Furthermore, as in Example 3, pre-learning the encoder with the auxiliary language data set and the target language data set and pre-learning the decoder with the target language data set further improves the accuracy rate.
  • Fig. 12 shows an example of recognition results in the scene character recognition model with the structure described in Non-Patent Document 2.
  • (1) to (3) of FIG. 12 it can be seen that erroneous recognition is prevented by performing the pre-learning process according to the first embodiment.
  • the decoder pre-learning process according to the third embodiment can prevent the recognition of words and phrases that do not hold true.
  • the apparatus of the present invention includes, for example, a single hardware entity, which includes an input unit to which a keyboard can be connected, an output unit to which a liquid crystal display can be connected, and a communication device (for example, a communication cable) capable of communicating with the outside of the hardware entity.
  • a communication device for example, a communication cable
  • CPU Central Processing Unit, which may include cache memory, registers, etc.
  • memory RAM and ROM external storage device such as hard disk
  • input unit, output unit, communication unit a CPU, a RAM, a ROM, and a bus for connecting data to and from an external storage device.
  • the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM.
  • a physical entity with such hardware resources includes a general purpose computer.
  • the external storage device of the hardware entity stores a program necessary for realizing the functions described above and data required for the processing of this program (not limited to the external storage device; It may be stored in a ROM, which is a dedicated storage device). Data obtained by processing these programs are appropriately stored in a RAM, an external storage device, or the like.
  • each program stored in an external storage device or ROM, etc.
  • the data necessary for processing each program are read into the memory as needed, and interpreted, executed and processed by the CPU as appropriate.
  • the CPU realizes a predetermined function (each component expressed as above, . . . unit, . . . means, etc.).
  • a program that describes this process can be recorded on a computer-readable recording medium.
  • Any computer-readable recording medium may be used, for example, a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, or the like.
  • magnetic recording devices hard disk devices, flexible disks, magnetic tapes, etc., as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc.
  • magneto-optical recording media such as MO (Magneto-Optical disc), etc. as semiconductor memory, EEP-ROM (Electrically Erasable and Programmable-Read Only Memory), etc. can be used.
  • this program is carried out, for example, by selling, assigning, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded.
  • the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.
  • a computer that executes such a program for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition. may be It should be noted that the program in this embodiment includes information that is used for processing by a computer and that conforms to the program (data that is not a direct instruction to the computer but has the property of prescribing the processing of the computer, etc.).
  • ASP
  • a hardware entity is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.
  • auxiliary language which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language)
  • target language dataset which is a dataset of image and text pair data in the target language.
  • a learning device that acquires the target language data set and the learned parameter, sets the learned parameter as an initial value of the language information recognition model, and learns the language information recognition model.
  • a non-transitory storage medium storing a program executable by a computer to perform a learning process
  • the learning process includes Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. to learn an intermediate language information recognition model, output parameters of the intermediate language information recognition model as learned parameters, A non-temporary storage medium that acquires the target language data set and the learned parameters, sets the learned parameters as initial values of a language information recognition model, and learns the language information recognition model.
  • the learning device according to appendix 1,
  • the intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder, outputting parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder;
  • a learning device that provides the learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model.
  • the non-transitory storage medium according to appendix 2 The intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder, outputting parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder; A non-temporary storage medium that provides learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model.
  • the learning device according to additional item 3 outputting parameters of only the decoder portion of the intermediate language information recognition model as learned parameters of the intermediate decoder; A learning device that provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model.
  • (Appendix 6) The non-transitory storage medium according to appendix 4, outputting parameters of only the decoder portion of the intermediate language information recognition model as learned parameters of the intermediate decoder; A non-temporary storage medium that provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model.
  • (Appendix 7) memory; at least one processor connected to the memory; including The processor An estimation device for estimating a character string corresponding to an image by using a language information recognition model learned by the learning device according to any one of supplementary items 1, 3, and 5 using an image as an input.
  • (Appendix 8) A non-temporary storage medium for estimating a character string corresponding to the image using a language information recognition model learned by the non-temporary storage medium according to any one of additional items 2, 4, and 6.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

A learning device according to the present invention includes a prior learning unit and a fine tuning unit. The prior learning unit acquires an auxiliary language data set, which is a data set of paired data of an image and text in a language which is not a target language (hereinafter, the auxiliary language), and a target language data set, which is a data set of paired data of an image and text in the target language; trains an intermediate language information recognition model; and outputs a parameter of the intermediate language information recognition model as a trained parameter. The fine tuning unit acquires the target language data set and the trained parameter, and trains a language information recognition model using the trained parameter as an initial value for the language information recognition model.

Description

学習装置、推定装置、学習方法、プログラムLEARNING DEVICE, ESTIMATION DEVICE, LEARNING METHOD, PROGRAM
 本発明は、学習装置、推定装置、学習方法、プログラムに関する。 The present invention relates to a learning device, an estimation device, a learning method, and a program.
 風景を撮影することで得られる情景画像には、交通標識や広告看板等、画像の理解に必要となる文字情報が多く含まれている。情景文字認識は、そのような情景画像から文字領域を切り出した画像を入力として、写った文字を認識し、機械が処理可能な文字列に変換するタスクである。近年、深層学習技術の進展により、情景文字認識を一気通貫型のモデルで実現する方法が提案されている。 Scene images obtained by shooting landscapes contain a lot of text information necessary for understanding the images, such as traffic signs and advertising billboards. Scene character recognition is a task of inputting an image obtained by extracting a character region from such a scene image, recognizing the captured character, and converting it into a character string that can be processed by a machine. In recent years, with the progress of deep learning technology, a method for realizing scene character recognition using a continuous model has been proposed.
 例えば非特許文献1や非特許文献2では、図1に模式的に示されているような、エンコーダとデコーダから構成されるモデルを用いた、情景文字認識技術が提供されている。このとき、エンコーダは、例えば畳み込みニューラルネットワークにより画像特徴を抽出する機能と、参考非特許文献1で提供されているTransformerエンコーダにより系列を考慮した特徴に変換する機能で構成される(参考非特許文献1:A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS), pp. 5998-6008, 2017.)。デコーダは、例えば参考非特許文献1に提供されているTransformerデコーダと、埋め込み層、出力層を用いた自己回帰型モデルで構成され、エンコーダの抽出した特徴から文字列の生成確率を出力する。学習では、教師データとして対象言語における多量の画像とテキストのペアデータを用いることで、例えば誤差逆伝播法により、エンコーダおよびデコーダのパラメータを最適化することができる。 For example, Non-Patent Literature 1 and Non-Patent Literature 2 provide scene character recognition technology using a model consisting of an encoder and a decoder, as schematically shown in FIG. At this time, the encoder is composed of, for example, a function of extracting image features using a convolutional neural network, and a function of transforming into features considering sequences by the Transformer encoder provided in Reference Non-Patent Document 1 (Reference Non-Patent Document 1: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS ), pp. 5998-6008, 2017.). The decoder is composed of, for example, the Transformer decoder provided in Reference 1, and an autoregressive model using an embedding layer and an output layer, and outputs the character string generation probability from the features extracted by the encoder. In learning, by using a large amount of image-text pair data in the target language as training data, the parameters of the encoder and decoder can be optimized by, for example, error backpropagation.
 非特許文献1や非特許文献2により実現されるような情景文字認識モデルを学習する場合、教師データとして対象言語における多量の画像とテキストのペアデータが必要となる。例えば英語のように話者が多くリソースが豊富な言語を対象言語とする場合は、利用可能なペアデータが多数提供されているため、ペアデータの収集は比較的容易である。しかし、例えば日本語のように話者が少ない言語を対象言語とする場合、簡易に利用可能なペアデータの提供が少ないという課題がある。特に参考非特許文献1に示されるTransformerのような複雑なモデルの学習には多量の教師データが必要であるが、リソースが少ない言語においてペアデータを大量に収集することは困難である。  When learning a scene character recognition model as realized by Non-Patent Document 1 and Non-Patent Document 2, a large amount of paired data of images and text in the target language is required as training data. For example, when the target language is a language with many speakers and abundant resources, such as English, collection of pair data is relatively easy because a large number of available pair data are provided. However, when targeting a language such as Japanese that has few speakers, there is a problem that there is little provision of easily available pair data. In particular, learning a complex model such as Transformer shown in Reference 1 requires a large amount of teacher data, but it is difficult to collect a large amount of pair data in a language with few resources.
 そこで本発明では、対象言語のペアデータが少ない場合でも高精度な情景文字認識モデルを学習することができる学習装置を提供することを目的とする。 Therefore, the object of the present invention is to provide a learning device that can learn a highly accurate scene character recognition model even when there is little pair data of the target language.
 本発明の学習装置は、事前学習部と、ファインチューニング部を含む。 The learning device of the present invention includes a pre-learning section and a fine-tuning section.
 事前学習部は、対象言語ではない言語(以下、補助言語)における画像とテキストのペアデータのデータセットである補助言語データセットと、対象言語における画像とテキストのペアデータのデータセットである対象言語データセットを取得して、中間言語情報認識モデルを学習し、中間言語情報認識モデルのパラメータを学習済みパラメータとして出力する。ファインチューニング部は、対象言語データセットと、学習済みパラメータを取得して、学習済みパラメータを言語情報認識モデルの初期値とし、言語情報認識モデルを学習する。 The pre-learning part consists of an auxiliary language dataset that is a dataset of image-text pair data in a language that is not the target language (hereinafter referred to as an auxiliary language), and a target language that is a dataset of image-text pair data in the target language A data set is acquired, an intermediate language information recognition model is learned, and the parameters of the intermediate language information recognition model are output as learned parameters. The fine-tuning unit acquires the target language data set and the learned parameters, sets the learned parameters as initial values of the language information recognition model, and learns the language information recognition model.
 本発明の学習装置によれば、対象言語のペアデータが少ない場合でも高精度な情景文字認識モデルを学習することができる。 According to the learning device of the present invention, it is possible to learn a highly accurate scene character recognition model even when there is little pair data of the target language.
従来技術の文字認識モデルの例を示す図。FIG. 4 is a diagram showing an example of a conventional character recognition model; 実施例1の学習装置の機能構成を示すブロック図。FIG. 2 is a block diagram showing the functional configuration of the learning device according to the first embodiment; FIG. 実施例1の学習装置の動作を示すフローチャート。4 is a flow chart showing the operation of the learning device according to the first embodiment; 実施例1の推定装置の構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of the estimation device according to the first embodiment; FIG. 実施例2の学習装置の機能構成を示すブロック図。FIG. 11 is a block diagram showing the functional configuration of the learning device according to the second embodiment; 実施例2の学習装置の動作を示すフローチャート。10 is a flowchart showing the operation of the learning device of Example 2; 実施例2の推定装置の構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of an estimation device according to a second embodiment; FIG. 実施例3の学習装置の機能構成を示すブロック図。FIG. 11 is a block diagram showing the functional configuration of the learning device of Example 3; 実施例3の学習装置の動作を示すフローチャート。10 is a flowchart showing the operation of the learning device of Example 3; 実施例3の推定装置の構成を示すブロック図。FIG. 11 is a block diagram showing the configuration of an estimation device according to a third embodiment; 検証実験の結果を示す図。The figure which shows the result of a verification experiment. 認識結果の例を示す図。4A and 4B are diagrams showing examples of recognition results; FIG. コンピュータの機能構成例を示す図。The figure which shows the functional structural example of a computer.
 以下の実施例では、非特許文献1や非特許文献2のような、情景画像から文字が写った部分を切り出した画像を入力とし、対応する文字列を出力する情景文字認識への適用を例に説明するが、本発明は情景文字認識に限定されるものではない。つまり本発明は、画像から任意のエンコーダ・デコーダ型の一気通貫型モデルを通して文字列を出力する技術全般に対して適用できるものである。例えば、光学文字認識や、静止画像および動画像に対するキャプション生成、リップリーディング等に対しても、本発明は適用可能である。よって、本発明が取り扱うモデルは、画像から言語情報を認識するタスク全般を取り扱うモデル(言語情報認識モデル)である。言語情報認識モデルの具体例として、例えば、光学文字認識等、文字が表示された画像を認識する文字認識モデル、静止画像および動画像の内容に対応するテキストを生成するキャプション生成モデル、口元が切り出された動画像から話したテキストを推定するリップリーディングモデル、上述の情景文字認識モデルなどがある。以下の実施例では、情景文字認識モデルにフォーカスして詳細な説明を行う。 In the following examples, an example of application to scene character recognition is to input an image obtained by cutting out a character part from a scene image, such as Non-Patent Document 1 or Non-Patent Document 2, and output a corresponding character string. , the invention is not limited to scene character recognition. In other words, the present invention can be applied to general techniques for outputting character strings from an image through an arbitrary encoder/decoder type one-through model. For example, the present invention can be applied to optical character recognition, caption generation for still and moving images, lip reading, and the like. Therefore, the model handled by the present invention is a model (linguistic information recognition model) handling general tasks for recognizing linguistic information from images. Specific examples of linguistic information recognition models include a character recognition model that recognizes images in which characters are displayed, such as optical character recognition, a caption generation model that generates text corresponding to the content of still images and moving images, and a mouth clipping model. There are the lip-reading model, which estimates spoken text from captured video, and the above-mentioned scene character recognition model. In the following examples, a detailed description will focus on the scene character recognition model.
 以下の実施例では、対象言語におけるペアデータのデータセット(以下、対象言語データセット)が少ない場合においても、高い精度で文字認識を実現するため、対象言語ではない言語(以下、補助言語)におけるペアデータのデータセット(以下、補助言語データセット)を活用する。ここで、補助言語データセットは、単に対象言語データセットと混合するのでなく、事前学習のために用いる。特に、画像からの特徴抽出を担うエンコーダを対象言語データセットと補助言語データセットによる大規模なデータセットで事前学習し、文字列の出力を担うデコーダを対象言語データセットで事前学習することで、対象言語の情景文字認識の精度を高める。 In the following examples, even if the dataset of paired data in the target language (hereinafter referred to as the target language dataset) is small, in order to realize character recognition with high accuracy, in a language that is not the target language (hereinafter referred to as the auxiliary language) Utilize paired data datasets (hereinafter referred to as auxiliary language datasets). Here, the auxiliary language dataset is used for pre-training rather than simply being mixed with the target language dataset. In particular, by pre-training the encoder, which is responsible for extracting features from images, with a large-scale dataset consisting of the target language dataset and auxiliary language dataset, and pre-training the decoder, which is responsible for character string output, with the target language dataset, Improve the accuracy of scene character recognition for the target language.
 以下、図2を参照して実施例1の学習装置の機能構成について説明する。同図に示すように、本実施例の学習装置1は、事前学習部11と、ファインチューニング部12を含む。同図に示す補助言語データセット記憶部91、対象言語データセット記憶部92、情景文字認識モデル記憶部93は、学習装置1の内部にあってもよいし、他の装置の内部にあってもよい。図3に、本実施例の学習装置1の動作を示す。本実施例の学習装置1の動作は、事前学習処理とファインチューニング処理を含んで構成される。 The functional configuration of the learning device of the first embodiment will be described below with reference to FIG. As shown in the figure, the learning device 1 of this embodiment includes a pre-learning section 11 and a fine-tuning section 12 . The auxiliary language data set storage unit 91, the target language data set storage unit 92, and the scene character recognition model storage unit 93 shown in FIG. good. FIG. 3 shows the operation of the learning device 1 of this embodiment. The operation of the learning device 1 of this embodiment includes pre-learning processing and fine-tuning processing.
<事前学習処理>
 事前学習部11は、補助言語データセット記憶部91から補助言語データセットを取得し、対象言語データセット記憶部92から対象言語データセットを取得して、中間情景文字認識モデルを学習し、中間情景文字認識モデルのパラメータを学習済みパラメータとして出力する(S11)。なお、「中間情景文字認識モデルを学習し、中間情景文字認識モデルのパラメータを学習済みパラメータとして出力する」処理を事前学習処理と呼ぶ。また、ここでいう「学習」とは、データセットに含まれるペアデータの、画像を入力としてテキストを出力するようにモデルを学習することを指す。
<Pre-learning processing>
The pre-learning unit 11 acquires an auxiliary language data set from the auxiliary language data set storage unit 91, acquires a target language data set from the target language data set storage unit 92, learns an intermediate scene character recognition model, and recognizes an intermediate scene. The parameters of the character recognition model are output as learned parameters (S11). The process of "learning an intermediate scene character recognition model and outputting parameters of the intermediate scene character recognition model as learned parameters" is called pre-learning process. Also, the term “learning” here refers to training a model so that an image of paired data included in a data set is input and text is output.
 事前学習部11は、対象言語データセットと補助言語データセットを混合して得られるデータセットに含まれる画像と文字列の組を、情景文字認識モデルの入出力の正解データの組として与え、誤差逆伝播法などの任意の学習方法で、パラメータを最適化すればよい。 The pre-learning unit 11 provides pairs of images and character strings included in a dataset obtained by mixing the target language dataset and the auxiliary language dataset as input/output correct data pairs of the scene character recognition model, and calculates the error Any learning method, such as backpropagation, may be used to optimize the parameters.
 情景文字認識モデルは、例えば非特許文献1や非特許文献2に示されるような、エンコーダ・デコーダ型の構造を持つ一気通貫型の情景文字認識モデルである。このとき、エンコーダは、例えば畳み込みニューラルネットワークにより画像特徴を抽出する機能と、参考非特許文献1で提供されているTransformerエンコーダにより系列を考慮した特徴に変換する機能により構成すればよい。また、デコーダは、例えば参考非特許文献1で提供されているTransformerデコーダと、埋め込み層、出力層を用いた自己回帰型モデルにより構成すればよい。なお、エンコーダの畳み込みニューラルネットワークは、物体認識などの任意のタスクで事前に学習されたパラメータを、初期値に用いて学習してもよく、Transformerエンコーダ等の他の部分も、事前に別途学習されたパラメータを初期値に用いて学習してもよい。 The scene character recognition model is a continuous scene character recognition model with an encoder/decoder structure, as shown in Non-Patent Document 1 and Non-Patent Document 2, for example. At this time, the encoder may be configured with, for example, a function of extracting image features using a convolutional neural network, and a function of transforming into features considering sequences using a Transformer encoder provided in Reference Non-Patent Document 1. Also, the decoder may be composed of, for example, a transformer decoder provided in Reference Non-Patent Document 1, and an autoregressive model using an embedding layer and an output layer. Note that the convolutional neural network of the encoder may be trained using parameters learned in advance in any task such as object recognition as initial values, and other parts such as the Transformer encoder are also learned separately in advance. It is also possible to learn using the parameters obtained as initial values.
 対象言語データセットは、推定したい対象言語におけるペアデータのデータセットであり、例えば日本語のように話者が少なくデータのリソースが少ない言語であってもよい。一方、補助言語データセットは、対象言語ではない言語におけるペアデータのデータセットであり、例えば英語のように話者が多くデータのリソースが多い言語であってもよい。ただしこれは一例であり、本発明は、対象言語データセットのデータ数が少なく、補助言語データセットのデータ数が多い場合に限定されない。さらに、対象言語、補助言語ともに単一の言語である必要はなく、対象言語、補助言語のいずれかまたは両方が、複数の言語の集合であってもよい。情景文字認識モデルが出力する語彙の辞書は、対象言語と補助言語の双方の語彙を網羅する辞書を用いることができる。 The target language dataset is a dataset of paired data in the target language to be estimated. For example, it may be a language with few speakers and few data resources, such as Japanese. On the other hand, the auxiliary language data set is a data set of paired data in a language other than the target language, and may be a language such as English that has many speakers and many data resources. However, this is only an example, and the present invention is not limited to a case where the number of data in the target language dataset is small and the number of data in the auxiliary language dataset is large. Furthermore, both the target language and the auxiliary language need not be a single language, and either or both of the target language and the auxiliary language may be a set of multiple languages. The vocabulary dictionary output by the scene character recognition model can be a dictionary that covers the vocabulary of both the target language and the auxiliary language.
<ファインチューニング処理>
 ファインチューニング部12は、対象言語データセット記憶部92から対象言語データセットを取得し、事前学習部11から中間情景文字認識モデルの学習済みパラメータを取得して、中間情景文字認識モデルの学習済みパラメータを情景文字認識モデルの初期値として情景文字認識モデルを学習し、情景文字認識モデルを出力する(S12)。出力された情景文字認識モデルは、情景文字認識モデル記憶部93に記憶される。なお、「中間情景文字認識モデルの学習済みパラメータを情景文字認識モデルの初期値として情景文字認識モデルを学習し、情景文字認識モデルを出力する」処理をファインチューニング処理と呼ぶ。
<Fine tuning process>
The fine tuning unit 12 acquires the target language data set from the target language data set storage unit 92, acquires the learned parameters of the intermediate scene character recognition model from the pre-learning unit 11, and sets the learned parameters of the intermediate scene character recognition model. is the initial value of the scene character recognition model, the scene character recognition model is learned, and the scene character recognition model is output (S12). The output scene character recognition model is stored in the scene character recognition model storage unit 93 . The process of "learning the scene character recognition model using the learned parameters of the intermediate scene character recognition model as the initial values of the scene character recognition model and outputting the scene character recognition model" is called fine-tuning process.
 ファインチューニング部12は、対象言語データセットに含まれる画像と文字列の組を、情景文字認識モデルの入出力の正解データの組として与え、誤差逆伝播法などの任意の学習方法で、パラメータを最適化すればよい。このとき、情景文字認識モデルが出力する語彙の辞書から、補助言語の語彙を削除し、関連するパラメータを削除してから学習してもよい。 The fine-tuning unit 12 provides pairs of images and character strings included in the target language data set as pairs of input/output correct data for the scene character recognition model, and uses an arbitrary learning method such as error backpropagation to obtain parameters. should be optimized. At this time, learning may be performed after deleting the auxiliary language vocabulary and related parameters from the vocabulary dictionary output by the scene character recognition model.
 上述のように、対象言語の文字認識精度の向上のため、補助言語のデータセットを用いることにより、対象言語のペアデータと補助言語のペアデータを混合した大規模なデータセットに基づいて、情景文字認識モデルを事前学習できる。これにより、対象言語のペアデータが少ない場合でも、対象言語における文字認識の精度を向上させることができる。 As mentioned above, in order to improve the accuracy of character recognition in the target language, the auxiliary language dataset is used to generate a scene image based on a large-scale dataset that is a mixture of the target language pair data and the auxiliary language pair data. You can pre-train a character recognition model. This makes it possible to improve the accuracy of character recognition in the target language even when there is little pair data in the target language.
<推定装置100>
 図4に示すように、本実施例の推定装置100は、画像を入力とし、上述の学習装置1により学習された情景文字認識モデルを用いて、画像に対応する文字列を推定し、推定結果を出力する。
<Estimation device 100>
As shown in FIG. 4, the estimating device 100 of this embodiment receives an image as an input, estimates a character string corresponding to the image using the scene character recognition model learned by the learning device 1 described above, and obtains an estimation result to output
 以下、図5を参照して、実施例2の学習装置の機能構成について説明する。同図に示すように、本実施例の学習装置2は、エンコーダ事前学習部21と、ファインチューニング部22を含む。同図に示す補助言語データセット記憶部91、対象言語データセット記憶部92、情景文字認識モデル記憶部93は、学習装置2の内部にあってもよいし、他の装置の内部にあってもよい。図6に、本実施例の学習装置2の動作を示す。本実施例の学習装置2の動作は、エンコーダ事前学習処理とファインチューニング処理を含んで構成される。 The functional configuration of the learning device of the second embodiment will be described below with reference to FIG. As shown in the figure, the learning device 2 of this embodiment includes an encoder pre-learning section 21 and a fine tuning section 22 . The auxiliary language data set storage unit 91, the target language data set storage unit 92, and the scene character recognition model storage unit 93 shown in FIG. good. FIG. 6 shows the operation of the learning device 2 of this embodiment. The operation of the learning device 2 of this embodiment includes encoder pre-learning processing and fine-tuning processing.
<エンコーダ事前学習処理>
 エンコーダ事前学習部21は、補助言語データセット記憶部91から補助言語データセットを取得し、対象言語データセット記憶部92から対象言語データセットを取得し、エンコーダとデコーダにより構成される中間情景文字認識モデルを学習し、エンコーダ部分のみのパラメータを中間エンコーダの学習済みパラメータとして出力する(S21)。なお、「エンコーダとデコーダにより構成される中間情景文字認識モデルを学習し、エンコーダ部分のみのパラメータを切り出して、中間エンコーダの学習済みパラメータとして出力する」処理をエンコーダ事前学習処理と呼ぶ。
<Encoder pre-learning processing>
The encoder pre-learning unit 21 acquires an auxiliary language data set from the auxiliary language data set storage unit 91, acquires a target language data set from the target language data set storage unit 92, and performs intermediate scene character recognition configured by an encoder and a decoder. The model is learned, and the parameters of only the encoder portion are output as the learned parameters of the intermediate encoder (S21). The process of "learning an intermediate scene character recognition model composed of an encoder and a decoder, extracting parameters of only the encoder part, and outputting them as learned parameters of the intermediate encoder" is called encoder pre-learning process.
 エンコーダ事前学習部21は、実施例1における事前学習処理と同様の方法を用いることができ、対象言語データセットと補助言語データセットを混合して得られるデータセットに含まれる画像と文字列の組を、情景文字認識モデルの入出力の正解データの組として与え、誤差逆伝播法などの任意の学習方法で、パラメータを最適化すればよい。このとき、情景文字認識モデルが出力する語彙の辞書は、対象言語と補助言語の双方の語彙を網羅する辞書を用いればよい。なお、中間エンコーダは、必ずしも図1に示されるエンコーダの全体としなくてもよく、例えば畳み込みニューラルネットワークにより画像特徴を抽出する部分のみでもよく、畳み込みニューラルネットワークにより画像特徴を抽出する部分から、複数の層からなるTransformerエンコーダの途中の層までとしてもよい。 The encoder pre-learning unit 21 can use the same method as the pre-learning process in the first embodiment, and the combination of the image and the character string included in the data set obtained by mixing the target language data set and the auxiliary language data set is given as a set of input/output correct data of the scene character recognition model, and the parameters are optimized by an arbitrary learning method such as error backpropagation. At this time, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of both the target language and the auxiliary language. Note that the intermediate encoder does not necessarily have to be the entire encoder shown in FIG. It may be up to the middle layer of the Transformer encoder consisting of layers.
<ファインチューニング処理>
 ファインチューニング部22は、対象言語データセット記憶部92から対象言語データセットを取得し、エンコーダ事前学習部21から中間エンコーダを取得し、中間エンコーダの学習済みパラメータを情景文字認識モデルのエンコーダの初期値として与え、対象言語データセットにより情景文字認識モデルを学習し、情景文字認識モデルを出力する(S22)。出力された情景文字認識モデルは、情景文字認識モデル記憶部93に記憶される。なお、「中間エンコーダの学習済みパラメータを情景文字認識モデルのエンコーダの初期値として与え、対象言語データセットにより情景文字認識モデルを学習し、情景文字認識モデルを出力する」処理を、ファインチューニング処理と呼ぶ。
<Fine tuning process>
The fine tuning unit 22 acquires the target language data set from the target language data set storage unit 92, acquires the intermediate encoder from the encoder pre-learning unit 21, and converts the learned parameters of the intermediate encoder to the initial values of the encoder of the scene character recognition model. , a scene character recognition model is learned from the target language data set, and the scene character recognition model is output (S22). The output scene character recognition model is stored in the scene character recognition model storage unit 93 . The process of "giving the learned parameters of the intermediate encoder as the initial values of the encoder of the scene character recognition model, learning the scene character recognition model with the target language data set, and outputting the scene character recognition model" is called the fine tuning process. call.
 ファインチューニング部22は、実施例1におけるファインチューニング処理と同様の方法を用いることができ、対象言語データセットに含まれる画像と文字列の組を、情景文字認識モデルの入出力の正解データの組として与え、誤差逆伝播法などの任意の学習方法で、パラメータを最適化すればよい。ここで、デコーダを含め、中間エンコーダに含まれない部分の初期値は、ランダム初期化などの任意の初期化方法により初期値を与えればよい。また、情景文字認識モデルが出力する語彙の辞書は、対象言語の語彙を網羅する辞書を用いればよい。 The fine-tuning unit 22 can use the same method as the fine-tuning process in the first embodiment. and optimize the parameters by any learning method such as error backpropagation. Here, the initial values of the parts not included in the intermediate encoder, including the decoder, may be given by any initialization method such as random initialization. Also, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language.
 上述のように、補助言語のデータセットと対象言語のデータセットを混合した大規模なデータセットで、エンコーダを事前学習することにより、エンコーダが学習する画像の枚数が増加する。これにより、画像の特徴を抽出する機能を担うエンコーダを、多様な文字の形状や背景画像を持つデータを用いて学習できることから、入力画像に対するエンコーダの頑健性を向上させることができる。 As described above, pre-training the encoder with a large dataset that is a mixture of the auxiliary language dataset and the target language dataset increases the number of images that the encoder learns. As a result, the encoder, which has the function of extracting image features, can be trained using data with various character shapes and background images, so the robustness of the encoder to input images can be improved.
<推定装置200>
 図7に示すように、本実施例の推定装置200は、画像を入力とし、上述の学習装置2により学習された情景文字認識モデルを用いて、画像に対応する文字列を推定し、推定結果を出力する。
<Estimation device 200>
As shown in FIG. 7, the estimating device 200 of the present embodiment receives an image as an input, estimates a character string corresponding to the image using the scene character recognition model learned by the learning device 2 described above, and obtains an estimation result to output
 以下、図8を参照して、実施例3の学習装置の機能構成について説明する。同図に示すように、本実施例の学習装置3は、エンコーダ事前学習部31と、デコーダ事前学習部32と、ファインチューニング部33を含む。同図に示す補助言語データセット記憶部91、対象言語データセット記憶部92、情景文字認識モデル記憶部93は、学習装置3の内部にあってもよいし、他の装置の内部にあってもよい。図9に、本実施例の学習装置3の動作を示す。本実施例の学習装置3の動作は、エンコーダ事前学習処理と、デコーダ事前学習処理と、ファインチューニング処理を含んで構成される。 The functional configuration of the learning device of Example 3 will be described below with reference to FIG. As shown in the figure, the learning device 3 of this embodiment includes an encoder pre-learning section 31, a decoder pre-learning section 32, and a fine tuning section 33. FIG. The auxiliary language data set storage unit 91, the target language data set storage unit 92, and the scene character recognition model storage unit 93 shown in FIG. good. FIG. 9 shows the operation of the learning device 3 of this embodiment. The operation of the learning device 3 of this embodiment includes an encoder pre-learning process, a decoder pre-learning process, and a fine tuning process.
<エンコーダ事前学習処理>
 エンコーダ事前学習部31は、実施例2と同様に、補助言語データセット記憶部91から補助言語データセットを取得し、対象言語データセット記憶部92から対象言語データセットを取得し、エンコーダとデコーダにより構成される中間情景文字認識モデルを学習し、エンコーダ部分のみを切り出して、中間エンコーダとして出力する(S31)。なお、「エンコーダとデコーダにより構成される中間情景文字認識モデルを学習し、エンコーダ部分のみを切り出して、中間エンコーダとして出力する」処理をエンコーダ事前学習処理と呼ぶ。
<Encoder pre-learning processing>
As in the second embodiment, the encoder pre-learning unit 31 acquires the auxiliary language dataset from the auxiliary language dataset storage unit 91, acquires the target language dataset from the target language dataset storage unit 92, and uses the encoder and decoder to The constructed intermediate scene character recognition model is learned, and only the encoder portion is cut out and output as an intermediate encoder (S31). The process of "learning an intermediate scene character recognition model composed of an encoder and a decoder, cutting out only the encoder part, and outputting it as an intermediate encoder" is called encoder pre-learning process.
<デコーダ事前学習処理>
 デコーダ事前学習部32は、対象言語データセットを入力とし、エンコーダとデコーダにより構成される第2の中間情景文字認識モデルを学習し、第2の中間情景文字認識モデルのデコーダ部分のみのパラメータを切り出して、中間デコーダの学習済みパラメータとして出力する(S32)。なお、「エンコーダとデコーダにより構成される第2の中間情景文字認識モデルを学習し、第2の中間情景文字認識モデルのデコーダ部分のみのパラメータを切り出して、中間デコーダの学習済みパラメータとして出力する」処理をデコーダ事前学習処理と呼ぶ。
<Decoder pre-learning process>
Decoder pre-learning unit 32 receives the target language data set as input, learns a second intermediate scene character recognition model composed of an encoder and a decoder, and extracts parameters only for the decoder part of the second intermediate scene character recognition model. are output as learned parameters of the intermediate decoder (S32). It should be noted that "a second intermediate scene character recognition model composed of an encoder and a decoder is learned, parameters of only the decoder portion of the second intermediate scene character recognition model are extracted, and output as learned parameters of the intermediate decoder". The processing is called decoder pre-learning processing.
 デコーダ事前学習部32は、対象言語データセットに含まれる画像と文字列の組を、情景文字認識モデルの入出力の正解データの組として与え、誤差逆伝播法などの任意の学習方法で、パラメータを最適化すればよい。このとき、情景文字認識モデルが出力する語彙の辞書は、対象言語の語彙を網羅する辞書を用いればよい。なお、中間デコーダは、必ずしも図1に示されるデコーダの全体としなくてもよく、例えば埋め込み層部分のみでもよく、埋め込み層から、複数の層からなるTransformerデコーダの途中の層までを中間デコーダとしてもよい。さらに、埋め込み層と出力層のみを中間デコーダとしてもよい。 The decoder pre-learning unit 32 gives pairs of images and character strings included in the target language data set as pairs of correct data for input and output of the scene character recognition model, and uses an arbitrary learning method such as error backpropagation to obtain parameters should be optimized. At this time, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language. Note that the intermediate decoder does not necessarily have to be the entire decoder shown in FIG. good. Furthermore, only the embedded layer and the output layer may be intermediate decoders.
<ファインチューニング処理>
 ファインチューニング部33は、対象言語データセット、中間エンコーダ、中間デコーダを入力とし、中間エンコーダと中間デコーダの学習済みパラメータを、それぞれ情景文字認識モデルのエンコーダとデコーダの初期値として与え、対象言語データセットにより情景文字認識モデルを学習し、情景文字認識モデルを出力する(S33)。なお、「中間エンコーダと中間デコーダの学習済みパラメータを、それぞれ情景文字認識モデルのエンコーダとデコーダの初期値として与え、対象言語データセットにより情景文字認識モデルを学習し、情景文字認識モデルを出力する」処理をファインチューニング処理と呼ぶ。出力された情景文字認識モデルは、情景文字認識モデル記憶部93に記憶される。
<Fine tuning process>
The fine-tuning unit 33 receives the target language data set, the intermediate encoder, and the intermediate decoder as inputs, and gives the learned parameters of the intermediate encoder and the intermediate decoder as initial values for the encoder and decoder of the scene character recognition model, respectively. Then, the scene character recognition model is learned and the scene character recognition model is output (S33). It should be noted that ``The learned parameters of the intermediate encoder and intermediate decoder are given as the initial values of the encoder and decoder of the scene character recognition model, respectively, the scene character recognition model is learned with the target language data set, and the scene character recognition model is output.'' The process is called fine-tuning process. The output scene character recognition model is stored in the scene character recognition model storage unit 93 .
 ファインチューニング部33は、実施例1におけるファインチューニング処理と同様の方法を用いることができ、対象言語データセットに含まれる画像と文字列の組を、情景文字認識モデルの入出力の正解データの組として与え、誤差逆伝播法などの任意の学習方法で、パラメータを最適化すればよい。ここで、中間エンコーダおよび中間デコーダに含まれない部分の初期値は、ランダム初期化などの任意の初期化方法により初期値を与えればよい。また、情景文字認識モデルが出力する語彙の辞書は、対象言語の語彙を網羅する辞書を用いればよい。 The fine-tuning unit 33 can use the same method as the fine-tuning process in the first embodiment. and optimize the parameters by any learning method such as error backpropagation. Here, the initial values of the portions not included in the intermediate encoder and intermediate decoder may be given by any initialization method such as random initialization. Also, the vocabulary dictionary output by the scene character recognition model may be a dictionary that covers the vocabulary of the target language.
 上述のように、対象言語のデータセットでデコーダを事前学習することで、デコーダを対象言語に特化させることができる。デコーダの事前学習とエンコーダの事前学習を組み合わせることで、対象言語における情景文字認識の精度を効果的に高めることができる。 As described above, pre-training the decoder with the dataset of the target language allows the decoder to specialize in the target language. Combining decoder pre-training and encoder pre-training can effectively improve the accuracy of scene character recognition in the target language.
<推定装置300>
 図10に示すように、本実施例の推定装置300は、画像を入力とし、上述の学習装置3により学習された情景文字認識モデルを用いて、画像に対応する文字列を推定し、推定結果を出力する。
<Estimation device 300>
As shown in FIG. 10, the estimating device 300 of this embodiment receives an image as an input, uses the scene character recognition model learned by the learning device 3 to estimate a character string corresponding to the image, and obtains an estimation result to output
<効果>
 上述の実施例に記載の学習装置により、エンコーダは対象言語データセットだけでなく、補助言語データセットを用いて事前学習する。これにより、エンコーダが学習する画像の枚数が増加することから、画像の特徴を抽出する機能を担うエンコーダを、多様な文字の形状や背景画像を持つデータを用いて学習でき、入力画像に対するエンコーダの頑健性が向上する。さらに、デコーダは対象言語データセットを用いて事前学習する。これにより、デコーダは対象言語に特化され、エンコーダの事前学習と組み合わせることで、効果的に対象言語における情景文字認識の精度を高めることができる。
<effect>
With the learning apparatus described in the above embodiments, the encoder is pre-trained using not only the target language data set, but also the auxiliary language data set. As a result, the number of images that the encoder learns from increases, so the encoder, which is responsible for extracting image features, can be trained using data with a variety of character shapes and background images. Improves robustness. Additionally, the decoder is pre-trained using the target language dataset. As a result, the decoder is specialized for the target language, and by combining with pre-training of the encoder, it is possible to effectively improve the accuracy of scene character recognition in the target language.
 非特許文献1および非特許文献2に記載の構造の情景文字認識モデルに対して、検証実験を行った。対象言語は日本語とし、教師データとしてペアデータ約70,000セットを用意した。補助言語は英語とし、教師データとしてペアデータ約8,000,000セットを用意した。補助言語データセットや事前学習を用いず、対象言語データセットによりスクラッチから学習したベースラインと、実施例1、実施例2、実施例3による学習方法を用いて学習した場合について、教師データに含まれない対象言語の画像約7,000枚に対する文字認識精度を検証した。評価には、完全一致による正解率を用いた。検証実験の結果を、図11に示す。同図に示すように、単に補助言語データセットを用いて学習する実施例1の学習方法であっても、ベースラインと比較して正解率が向上することが分かる。さらに、実施例3のように、エンコーダを補助言語データセットと対象言語データセットで事前学習し、デコーダを対象言語データセットで事前学習することで、正解率がより向上することが分かる。 Verification experiments were conducted on the scene character recognition model with the structure described in Non-Patent Document 1 and Non-Patent Document 2. The target language is Japanese, and about 70,000 sets of pair data were prepared as training data. English was used as the auxiliary language, and about 8,000,000 sets of paired data were prepared as training data. The teacher data includes the baseline learned from scratch using the target language dataset without using the auxiliary language dataset or pre-learning, and the case of learning using the learning methods according to Examples 1, 2, and 3. We verified the accuracy of character recognition for approximately 7,000 images of target languages that cannot be read. For the evaluation, the accuracy rate by perfect match was used. The results of verification experiments are shown in FIG. As shown in the figure, even with the learning method of Example 1 in which learning is performed simply using the auxiliary language data set, the accuracy rate is improved compared to the baseline. Furthermore, as in Example 3, pre-learning the encoder with the auxiliary language data set and the target language data set and pre-learning the decoder with the target language data set further improves the accuracy rate.
 図12に、非特許文献2に記載の構造の情景文字認識モデルにおける、認識結果の例を示す。図12の(1)~(3)に示す通り、実施例1による事前学習処理を行うことで、誤認識が防止されていることが分かる。また、同図の(4)に示す通り、実施例3によるデコーダ事前学習処理を行うことで、成立しない語句が認識されることを防止できることが分かる。  Fig. 12 shows an example of recognition results in the scene character recognition model with the structure described in Non-Patent Document 2. As shown in (1) to (3) of FIG. 12, it can be seen that erroneous recognition is prevented by performing the pre-learning process according to the first embodiment. In addition, as shown in (4) of the same figure, it can be seen that the decoder pre-learning process according to the third embodiment can prevent the recognition of words and phrases that do not hold true.
<補記>
 本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置(例えば通信ケーブル)が接続可能な通信部、CPU(Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい)、メモリであるRAMやROM、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、CPU、RAM、ROM、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、CD-ROMなどの記録媒体を読み書きできる装置(ドライブ)などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。
<Addendum>
The apparatus of the present invention includes, for example, a single hardware entity, which includes an input unit to which a keyboard can be connected, an output unit to which a liquid crystal display can be connected, and a communication device (for example, a communication cable) capable of communicating with the outside of the hardware entity. can be connected to the communication unit, CPU (Central Processing Unit, which may include cache memory, registers, etc.), memory RAM and ROM, external storage device such as hard disk, input unit, output unit, communication unit , a CPU, a RAM, a ROM, and a bus for connecting data to and from an external storage device. Also, if necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. A physical entity with such hardware resources includes a general purpose computer.
 ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている(外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるROMに記憶させておくこととしてもよい)。また、これらのプログラムの処理によって得られるデータなどは、RAMや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the functions described above and data required for the processing of this program (not limited to the external storage device; It may be stored in a ROM, which is a dedicated storage device). Data obtained by processing these programs are appropriately stored in a RAM, an external storage device, or the like.
 ハードウェアエンティティでは、外部記憶装置(あるいはROMなど)に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にCPUで解釈実行・処理される。その結果、CPUが所定の機能(上記、…部、…手段などと表した各構成要件)を実現する。 In the hardware entity, each program stored in an external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and interpreted, executed and processed by the CPU as appropriate. . As a result, the CPU realizes a predetermined function (each component expressed as above, . . . unit, . . . means, etc.).
 本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiments, and modifications can be made as appropriate without departing from the scope of the present invention. Further, the processes described in the above embodiments are not only executed in chronological order according to the described order, but may also be executed in parallel or individually according to the processing capacity of the device that executes the processes or as necessary. .
 既述のように、上記実施形態において説明したハードウェアエンティティ(本発明の装置)における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions of the hardware entity (apparatus of the present invention) described in the above embodiments are implemented by a computer, the processing contents of the functions that the hardware entity should have are described by a program. By executing this program on a computer, the processing functions of the hardware entity are realized on the computer.
 上述の各種の処理は、図13に示すコンピュータ10000の記録部10020に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部10010、入力部10030、出力部10040などに動作させることで実施できる。 The various processes described above are performed by loading a program for executing each step of the above method into the recording unit 10020 of the computer 10000 shown in FIG. can.
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD(Digital Versatile Disc)、DVD-RAM(Random Access Memory)、CD-ROM(Compact Disc Read Only Memory)、CD-R(Recordable)/RW(ReWritable)等を、光磁気記録媒体として、MO(Magneto-Optical disc)等を、半導体メモリとしてEEP-ROM(Electrically Erasable and Programmable-Read Only Memory)等を用いることができる。 A program that describes this process can be recorded on a computer-readable recording medium. Any computer-readable recording medium may be used, for example, a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, or the like. Specifically, for example, as magnetic recording devices, hard disk devices, flexible disks, magnetic tapes, etc., as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. as magneto-optical recording media, such as MO (Magneto-Optical disc), etc. as semiconductor memory, EEP-ROM (Electrically Erasable and Programmable-Read Only Memory), etc. can be used.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 In addition, the distribution of this program is carried out, for example, by selling, assigning, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Further, the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program, for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition. may be It should be noted that the program in this embodiment includes information that is used for processing by a computer and that conforms to the program (data that is not a direct instruction to the computer but has the property of prescribing the processing of the computer, etc.).
 また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Also, in this embodiment, a hardware entity is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.
 以上の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiments, the following additional remarks are disclosed.
(付記項1)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 対象言語ではない言語(以下、補助言語)における画像とテキストのペアデータのデータセットである補助言語データセットと、対象言語における画像とテキストのペアデータのデータセットである対象言語データセットを取得して、中間言語情報認識モデルを学習し、前記中間言語情報認識モデルのパラメータを学習済みパラメータとして出力し、
 前記対象言語データセットと、前記学習済みパラメータを取得して、学習済みパラメータを言語情報認識モデルの初期値とし、言語情報認識モデルを学習する
 学習装置。
(付記項2)
 学習処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
 前記学習処理は、
 対象言語ではない言語(以下、補助言語)における画像とテキストのペアデータのデータセットである補助言語データセットと、対象言語における画像とテキストのペアデータのデータセットである対象言語データセットを取得して、中間言語情報認識モデルを学習し、前記中間言語情報認識モデルのパラメータを学習済みパラメータとして出力し、
 前記対象言語データセットと、前記学習済みパラメータを取得して、学習済みパラメータを言語情報認識モデルの初期値とし、言語情報認識モデルを学習する
 非一時的記憶媒体。
(付記項3)
 付記項1に記載の学習装置であって、
 前記中間言語情報認識モデル、前記言語情報認識モデルはエンコーダとデコーダにより構成され、
 前記中間言語情報認識モデルのエンコーダ部分のみのパラメータを中間エンコーダの学習済みパラメータとして出力し、
 前記中間エンコーダの学習済みパラメータを前記言語情報認識モデルのエンコーダの初期値として与える
 学習装置。
(付記項4)
 付記項2に記載の非一時的記憶媒体であって、
 前記中間言語情報認識モデル、前記言語情報認識モデルはエンコーダとデコーダにより構成され、
 前記中間言語情報認識モデルのエンコーダ部分のみのパラメータを中間エンコーダの学習済みパラメータとして出力し、
 前記中間エンコーダの学習済みパラメータを前記言語情報認識モデルのエンコーダの初期値として与える
 非一時的記憶媒体。
(付記項5)
 付記項3に記載の学習装置であって、
 前記中間言語情報認識モデルのデコーダ部分のみのパラメータを中間デコーダの学習済みパラメータとして出力し、
 前記中間デコーダの学習済みパラメータを前記言語情報認識モデルのデコーダの初期値として与える
 学習装置。
(付記項6)
 付記項4に記載の非一時的記憶媒体であって、
 前記中間言語情報認識モデルのデコーダ部分のみのパラメータを中間デコーダの学習済みパラメータとして出力し、
 前記中間デコーダの学習済みパラメータを前記言語情報認識モデルのデコーダの初期値として与える
 非一時的記憶媒体。
(付記項7)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 画像を入力とし、付記項1、3、5の何れかに記載の学習装置により学習された言語情報認識モデルを用いて、前記画像に対応する文字列を推定する
 推定装置。
(付記項8)
 付記項2、4、6の何れかに記載の非一時的記憶媒体により学習された言語情報認識モデルを用いて、前記画像に対応する文字列を推定する
 非一時的記憶媒体。
(Appendix 1)
memory;
at least one processor connected to the memory;
including
The processor
Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. to learn an intermediate language information recognition model, output parameters of the intermediate language information recognition model as learned parameters,
A learning device that acquires the target language data set and the learned parameter, sets the learned parameter as an initial value of the language information recognition model, and learns the language information recognition model.
(Appendix 2)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes
Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. to learn an intermediate language information recognition model, output parameters of the intermediate language information recognition model as learned parameters,
A non-temporary storage medium that acquires the target language data set and the learned parameters, sets the learned parameters as initial values of a language information recognition model, and learns the language information recognition model.
(Appendix 3)
The learning device according to appendix 1,
The intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder,
outputting parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder;
A learning device that provides the learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model.
(Appendix 4)
The non-transitory storage medium according to appendix 2,
The intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder,
outputting parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder;
A non-temporary storage medium that provides learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model.
(Appendix 5)
The learning device according to additional item 3,
outputting parameters of only the decoder portion of the intermediate language information recognition model as learned parameters of the intermediate decoder;
A learning device that provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model.
(Appendix 6)
The non-transitory storage medium according to appendix 4,
outputting parameters of only the decoder portion of the intermediate language information recognition model as learned parameters of the intermediate decoder;
A non-temporary storage medium that provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model.
(Appendix 7)
memory;
at least one processor connected to the memory;
including
The processor
An estimation device for estimating a character string corresponding to an image by using a language information recognition model learned by the learning device according to any one of supplementary items 1, 3, and 5 using an image as an input.
(Appendix 8)
A non-temporary storage medium for estimating a character string corresponding to the image using a language information recognition model learned by the non-temporary storage medium according to any one of additional items 2, 4, and 6.

Claims (7)

  1.  対象言語ではない言語(以下、補助言語)における画像とテキストのペアデータのデータセットである補助言語データセットと、対象言語における画像とテキストのペアデータのデータセットである対象言語データセットを取得して、中間言語情報認識モデルを学習し、前記中間言語情報認識モデルのパラメータを学習済みパラメータとして出力する事前学習部と、
     前記対象言語データセットと、前記学習済みパラメータを取得して、学習済みパラメータを言語情報認識モデルの初期値とし、言語情報認識モデルを学習するファインチューニング部を含む
     学習装置。
    Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. a pre-learning unit that learns an intermediate language information recognition model and outputs parameters of the intermediate language information recognition model as learned parameters;
    A learning device including a fine-tuning unit that acquires the target language data set and the learned parameters, sets the learned parameters as initial values of a language information recognition model, and learns the language information recognition model.
  2.  請求項1に記載の学習装置であって、
     前記中間言語情報認識モデル、前記言語情報認識モデルはエンコーダとデコーダにより構成され、
     前記事前学習部は、前記中間言語情報認識モデルのエンコーダ部分のみのパラメータを中間エンコーダの学習済みパラメータとして出力し、
     前記ファインチューニング部は、前記中間エンコーダの学習済みパラメータを前記言語情報認識モデルのエンコーダの初期値として与える
     学習装置。
    The learning device according to claim 1,
    The intermediate language information recognition model and the language information recognition model are composed of an encoder and a decoder,
    The pre-learning unit outputs parameters of only the encoder part of the intermediate language information recognition model as learned parameters of the intermediate encoder,
    The fine-tuning unit provides the learned parameters of the intermediate encoder as initial values of the encoder of the language information recognition model. A learning device.
  3.  請求項2に記載の学習装置であって、
     前記事前学習部は、
     対象言語データセットを入力として、第2の中間言語情報認識モデルを学習し、前記第2の中間言語情報認識モデルのデコーダ部分のみのパラメータを中間デコーダの学習済みパラメータとして出力し、
     前記ファインチューニング部は、前記中間デコーダの学習済みパラメータを前記言語情報認識モデルのデコーダの初期値として与える
     学習装置。
    The learning device according to claim 2,
    The pre-learning unit
    learning a second intermediate language information recognition model using the target language data set as input, outputting parameters of only the decoder part of the second intermediate language information recognition model as trained parameters of the intermediate decoder;
    The fine-tuning unit provides the learned parameters of the intermediate decoder as initial values of the decoder of the language information recognition model. Learning device.
  4.  画像を入力とし、請求項1から3の何れかに記載の学習装置により学習された言語情報認識モデルを用いて、前記画像に対応する文字列を推定する
     推定装置。
    An estimation device that receives an image as an input and estimates a character string corresponding to the image by using a language information recognition model learned by the learning device according to any one of claims 1 to 3.
  5.  学習装置が実行する学習方法であって、
     対象言語ではない言語(以下、補助言語)における画像とテキストのペアデータのデータセットである補助言語データセットと、対象言語における画像とテキストのペアデータのデータセットである対象言語データセットを取得して、中間言語情報認識モデルを学習し、前記中間言語情報認識モデルのパラメータを学習済みパラメータとして出力するステップと、
     前記対象言語データセットと、前記学習済みパラメータを取得して、学習済みパラメータを言語情報認識モデルの初期値とし、言語情報認識モデルを学習するステップを含む
     学習方法。
    A learning method executed by a learning device,
    Acquire the auxiliary language dataset, which is a dataset of image and text pair data in a language that is not the target language (hereinafter referred to as auxiliary language), and the target language dataset, which is a dataset of image and text pair data in the target language. a step of learning an intermediate language information recognition model and outputting parameters of the intermediate language information recognition model as learned parameters;
    A learning method comprising the step of acquiring the target language data set and the learned parameters, setting the learned parameters as initial values of a language information recognition model, and learning the language information recognition model.
  6.  コンピュータを請求項1から3の何れかに記載の学習装置として機能させるプログラム。 A program that causes a computer to function as the learning device according to any one of claims 1 to 3.
  7.  コンピュータを請求項4に記載の推定装置として機能させるプログラム。 A program that causes a computer to function as the estimation device according to claim 4.
PCT/JP2021/025614 2021-07-07 2021-07-07 Learning device, estimation device, learning method, and program WO2023281659A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/025614 WO2023281659A1 (en) 2021-07-07 2021-07-07 Learning device, estimation device, learning method, and program
JP2023532949A JPWO2023281659A1 (en) 2021-07-07 2021-07-07

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/025614 WO2023281659A1 (en) 2021-07-07 2021-07-07 Learning device, estimation device, learning method, and program

Publications (1)

Publication Number Publication Date
WO2023281659A1 true WO2023281659A1 (en) 2023-01-12

Family

ID=84800480

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/025614 WO2023281659A1 (en) 2021-07-07 2021-07-07 Learning device, estimation device, learning method, and program

Country Status (2)

Country Link
JP (1) JPWO2023281659A1 (en)
WO (1) WO2023281659A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017199149A (en) * 2016-04-26 2017-11-02 ヤフー株式会社 Learning device, learning method, and learning program
JP2019091434A (en) * 2017-11-14 2019-06-13 アドビ インコーポレイテッド Improved font recognition by dynamically weighting multiple deep learning neural networks
JP2019204147A (en) * 2018-05-21 2019-11-28 株式会社デンソーアイティーラボラトリ Learning apparatus, learning method, program, learnt model and lip reading apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017199149A (en) * 2016-04-26 2017-11-02 ヤフー株式会社 Learning device, learning method, and learning program
JP2019091434A (en) * 2017-11-14 2019-06-13 アドビ インコーポレイテッド Improved font recognition by dynamically weighting multiple deep learning neural networks
JP2019204147A (en) * 2018-05-21 2019-11-28 株式会社デンソーアイティーラボラトリ Learning apparatus, learning method, program, learnt model and lip reading apparatus

Also Published As

Publication number Publication date
JPWO2023281659A1 (en) 2023-01-12

Similar Documents

Publication Publication Date Title
US10657969B2 (en) Identity verification method and apparatus based on voiceprint
AU2019239454B2 (en) Method and system for retrieving video temporal segments
CN104765728B (en) The method trained the method and apparatus of neutral net and determine sparse features vector
WO2020052069A1 (en) Method and apparatus for word segmentation
US11157707B2 (en) Natural language response improvement in machine assisted agents
JP6812381B2 (en) Voice recognition accuracy deterioration factor estimation device, voice recognition accuracy deterioration factor estimation method, program
US11893813B2 (en) Electronic device and control method therefor
CN116561592B (en) Training method of text emotion recognition model, text emotion recognition method and device
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN110610006B (en) Morphological double-channel Chinese word embedding method based on strokes and fonts
JP7409381B2 (en) Utterance section detection device, utterance section detection method, program
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN117349402A (en) Emotion cause pair identification method and system based on machine reading understanding
CN112686060A (en) Text translation method and device, electronic equipment and storage medium
WO2023281659A1 (en) Learning device, estimation device, learning method, and program
CN115496077B (en) Multimode emotion analysis method and device based on modal observation and grading
CN112686059B (en) Text translation method, device, electronic equipment and storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN117292679A (en) Training method of voice recognition model, voice recognition method and related equipment
Wang et al. Listen, Decipher and Sign: Toward Unsupervised Speech-to-Sign Language Recognition
WO2020162238A1 (en) Speech recognition device, speech recognition method, and program
CN113627155A (en) Data screening method, device, equipment and storage medium
CN108630192B (en) non-Chinese speech recognition method, system and construction method thereof
JP6526607B2 (en) Learning apparatus, learning method, and learning program
Sravani et al. Multimodal Sentimental Classification using Long-Short Term Memory

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2023532949

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21949293

Country of ref document: EP

Kind code of ref document: A1