CN114626392B

CN114626392B - End-to-end text image translation model training method

Info

Publication number: CN114626392B
Application number: CN202210193873.3A
Authority: CN
Inventors: 周玉; 马聪
Original assignee: Beijing Zhongkefan Language Technology Co ltd
Current assignee: Beijing Zhongkefan Language Technology Co ltd
Priority date: 2022-01-29
Filing date: 2022-03-01
Publication date: 2023-02-21
Anticipated expiration: 2042-03-01
Also published as: CN114626392A

Abstract

The present disclosure provides a method for training an end-to-end text-image translation model, including: preprocessing an image containing a source language text and the source language text to obtain a preprocessed sub-image and a preprocessed text character string; encoding a subimage containing a source language text by an image encoder to obtain image characteristics, and encoding a preprocessed text character string by a text encoder to obtain text characteristics; respectively coding the image characteristics and the text characteristics through a sequence characteristic coder to obtain image sequence characteristics and text sequence characteristics; calculating different loss values based on the image sequence characteristics and the text sequence characteristics; constructing a loss function based on different loss values; and updating parameters of the training model when training through the training model based on the loss function. The disclosure also provides an end-to-end text image translation model training device, an electronic device and a readable storage medium.

Description

End-to-end text image translation model training method

Technical Field

The disclosure relates to the technical field of natural language processing, in particular to an end-to-end text image translation model training method.

Background

Text-to-image translation is the automatic translation of a source language contained in a picture or video into a target language using a computer system. The text image translation technology can help people to translate and understand text contents in pictures and videos quickly and effectively. The technology can quickly translate the text in one language in the image or video into a different language to promote the understanding of people using the different language.

The current commonly used text image translation architecture is to implement system cascade connection of a text image recognition system and a machine translation system to translate a source language in a picture. However, two subtasks of the system cascade are trained independently on respective training data sets, resulting in inconsistent training fields for the subtasks. Meanwhile, when the system is deployed in a system cascade mode, two discrete models need to be deployed, the deployment complexity is increased, the model storage space complexity is high, and the model decoding time complexity is high. Although the model space complexity of the end-to-end text image translation system is small, due to the lack of training data, model design and other problems, the performance of the end-to-end text image translation model at the present stage is still poor. In addition, the existing research and application do not consider the characteristics of text image translation and modeling, that is, text images with the same text content, although their fonts, background pictures, text directions and the like are different, the feature expressions of these text images with similar texts should be similar in the task of text image translation. Because the text image translation and the text translation have symmetry, the text image and the plain text containing the same text content should have similar feature expressions in the encoding stage of the translation. Correspondingly, source language text sentences with similar semantics should also have similar text feature codes in the text translation coding process.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides an end-to-end text-to-image translation model training method, apparatus, electronic device, and readable storage medium.

According to one aspect of the disclosure, a method for training an end-to-end text-to-image translation model is provided, which includes:

preprocessing an image containing a source language text to obtain a sub-image containing the source language text, preprocessing the source language text corresponding to the image containing the source language text, and obtaining a preprocessed text character string;

encoding the subimage containing the source language text through an image encoder to obtain image characteristics, and encoding the preprocessed text character string through a text encoder to obtain text characteristics;

respectively encoding the image features and the text features through an image sequence feature encoder and a text sequence feature encoder to obtain image sequence features corresponding to the image features and text sequence features corresponding to the text features;

calculating different loss values based on the image sequence characteristics and the text sequence characteristics;

constructing a loss function based on the different loss values;

updating parameters of a training model when training by the training model based on the loss function.

According to at least one embodiment of the present disclosure, an end-to-end text-to-image translation model training method for preprocessing an image including a source language text to obtain a sub-image including the source language text includes:

carrying out size adjustment on an image containing a source language text by an image scaling method;

obtaining the position of an area where a source language text in the image is located by a text detection method, and performing image segmentation on the located area to obtain a sub-image;

and rearranging the texts in the sub-images according to a preset direction.

According to at least one embodiment of the present disclosure, an end-to-end text-to-image translation model training method for preprocessing a source language text corresponding to an image including the source language text to obtain a preprocessed text character string includes:

standardizing punctuation marks contained in the source language text corresponding to the image containing the source language text;

segmenting the source language text corresponding to the image containing the source language text;

judging whether each word belongs to an unknown word or not for the words after word segmentation processing, and replacing the unregistered words with mark symbols if the words belong to the unknown words;

the unknown words refer to words which appear in the source language text and can not be matched with a standard semantic text library.

According to at least one embodiment of the present disclosure, an end-to-end text-to-image translation model training method, which encodes an image feature and a text feature through an image sequence feature encoder and a text sequence feature encoder, respectively, to obtain an image sequence feature corresponding to the image feature and a text sequence feature corresponding to the text feature, includes:

judging whether the image sequence feature encoder and the text sequence feature encoder are the same sequence feature encoder;

if the image sequence feature encoder and the text sequence feature encoder are the same sequence feature encoder, processing the image features and the text features through feature transformation to enable hidden layer dimensions of the image features and hidden layer dimensions of the text features to be consistent;

if the image sequence feature encoder and the text sequence feature encoder are different sequence feature encoders, the image features and the text features are encoded respectively by the image sequence feature encoder and the text sequence feature encoder, and the encoded image features and the encoded text features are subjected to feature transformation processing, so that hidden layer dimensions of the image features and hidden layer dimensions of the text features are kept consistent.

According to the training method of the end-to-end text image translation model of at least one embodiment of the present disclosure, based on the image sequence features and the text sequence features, different loss values are calculated, and the method comprises the following steps:

and calculating the contrast loss between the images, the contrast loss between the texts and the text and the contrast loss between the images and the texts based on the image sequence characteristics and the text sequence characteristics.

According to the method for training the end-to-end text image translation model, different loss values are calculated based on the image sequence characteristics and the text sequence characteristics, and the method comprises the following steps:

decoding through a decoder based on the image sequence characteristics and the text sequence characteristics to obtain a corresponding decoded target language, and calculating end-to-end text image translation loss and end-to-end text translation loss based on the decoded target language;

and calculating the loss of the end-to-end text image translation based on a target language and a target language standard answer obtained by image sequence feature decoding, and calculating the result and the target language standard answer based on the text sequence feature decoding.

According to the training method of the end-to-end text image translation model of at least one embodiment of the present disclosure, based on the different loss values, a loss function is constructed, including:

and constructing the loss function by a weighted summation method based on the different loss values.

According to still another aspect of the present disclosure, there is provided an end-to-end text-image translation model training apparatus, including:

the preprocessing module is used for acquiring a sub-image containing the source language text by performing text detection and image segmentation processing on an image containing the source language text, preprocessing the source language text corresponding to the image containing the source language text and acquiring a preprocessed text character string;

the characteristic acquisition module is used for encoding the subimage containing the source language text through an image encoder to acquire image characteristics, and encoding the preprocessed text character string through a text encoder to acquire text characteristics;

the sequence feature coding module is used for coding the image features and the text features respectively through an image sequence feature coder and a text sequence feature coder to obtain image sequence features corresponding to the image features and text sequence features corresponding to the text features;

the loss calculation module is used for calculating different loss values based on the image sequence characteristics and the text sequence characteristics;

a loss function constructing module, which constructs a loss function based on the different loss values;

and the training module is used for updating the parameters of the training model when training is carried out through the training model on the basis of the loss function.

According to at least one embodiment of the present disclosure, an end-to-end text-to-image translation model training apparatus for preprocessing an image including a source language text to obtain a sub-image including the source language text includes:

obtaining the position of an area where a source language text in the image is located by a text detection method, and carrying out image segmentation on the area where the source language text is located to obtain sub-images;

and rearranging the texts in the sub-images according to a preset direction.

According to at least one embodiment of the present disclosure, an end-to-end text-to-image translation model training device preprocesses a source language text corresponding to an image including the source language text, and acquires a text string after preprocessing, including:

segmenting the source language text corresponding to the image containing the source language text; and

judging whether each word belongs to an unknown word or not for the word after word segmentation processing, and replacing the unregistered word with a mark symbol if the word belongs to the unknown word;

the unknown words refer to words which appear in the source language text and can not be matched with a standard vocabulary library.

According to the end-to-end text image translation model training device of at least one embodiment of the present disclosure, the image features and the text features are encoded by an image sequence feature encoder and a text sequence feature encoder respectively, and image sequence features corresponding to the image features and text sequence features corresponding to the text features are obtained, which includes:

if the image sequence feature encoder and the text sequence feature encoder are different sequence feature encoders, the image features and the text features are encoded respectively through the image sequence feature encoder and the text sequence feature encoder, and feature transformation processing is performed on the encoded image features and the encoded text features, so that hidden layer dimensions of the image features and hidden layer dimensions of the text features are kept consistent.

According to the end-to-end text image translation model training device of at least one embodiment of the present disclosure, calculating unused loss values based on the image sequence features and the text sequence features, the method includes:

and calculating the contrast loss between the images, the contrast loss between the texts and the text and the contrast loss between the images based on the image sequence characteristics and the text sequence characteristics.

According to the training device for the end-to-end text image translation model of at least one embodiment of the present disclosure, different loss values are calculated based on the image sequence features and the text sequence features, and the method includes:

decoding through a decoder based on the image sequence characteristics and the text sequence characteristics to obtain a corresponding decoded target end language, and calculating end-to-end text image translation loss and end-to-end text translation loss based on the decoded target language;

and calculating the loss of the end-to-end text image translation based on the target language and the target language standard answer obtained by image sequence characteristic decoding, and calculating the result and the target language standard answer based on the loss of the end-to-end text image translation based on the text sequence characteristic decoding.

According to the training device for the end-to-end text-image translation model of at least one embodiment of the present disclosure, based on the different loss values, a loss function is constructed, including:

According to yet another aspect of the present disclosure, there is provided an electronic device including:

a memory storing execution instructions;

a processor executing execution instructions stored by the memory to cause the processor to perform any of the methods described above.

According to yet another aspect of the present disclosure, there is provided a readable storage medium having stored therein executable instructions, which when executed by a processor, are configured to implement the method of any one of the above.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram of a method of end-to-end text-to-image translation model training according to one embodiment of the present disclosure.

FIG. 2 is a flow diagram of a method for pre-processing an image containing source language text according to one embodiment of the present disclosure.

FIG. 3 is a flowchart illustrating a method for preprocessing source language text corresponding to an image containing the source language text, according to one embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a method for encoding image features and text features by a sequence feature encoder to obtain image sequence features corresponding to the image features and text sequence features corresponding to the text features according to an embodiment of the present disclosure.

FIG. 5 is a flow diagram of a method for constructing a loss function based on different loss values, according to one embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of an apparatus for end-to-end text-image translation model training according to an embodiment of the present disclosure.

FIG. 7 is an end-to-end text-image translation model training architecture diagram according to one embodiment of the present disclosure.

FIG. 8 is a schematic diagram of a text encoder architecture according to one embodiment of the present disclosure.

Fig. 9 is a schematic diagram of an image encoder structure according to an embodiment of the present disclosure.

FIG. 10 is a schematic diagram of a self-attention based sequence feature encoder architecture, according to one embodiment of the present disclosure.

Fig. 11 is a schematic diagram of a decoder structure according to an embodiment of the present disclosure.

FIG. 12 is a schematic illustration of data augmentation according to one embodiment of the present disclosure.

FIG. 13 is a schematic illustration of a given source language text generating an image containing the given source language text according to one embodiment of the present disclosure.

Description of the reference numerals

1002. Pre-processing module

1004. Feature acquisition module

1006. Sequence feature coding module

1008. Loss calculation module

1010. Loss function building block

1012. Training module

1100. Bus line

1200. Processor with a memory having a plurality of memory cells

1300. Memory device

1400. Other circuits.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, the features of the various embodiments/examples may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.

The use of cross-hatching and/or shading in the drawings is generally used to clarify the boundaries between adjacent components. As such, unless otherwise noted, the presence or absence of cross-hatching or shading does not convey or indicate any preference or requirement for a particular material, material property, size, proportion, commonality between the illustrated components and/or any other characteristic, attribute, property, etc., of a component. Further, in the drawings, the size and relative sizes of components may be exaggerated for clarity and/or descriptive purposes. While example embodiments may be practiced differently, the specific process sequence may be performed in a different order than that described. For example, two processes described consecutively may be performed substantially simultaneously or in reverse order to that described. In addition, like reference numerals denote like parts.

When an element is referred to as being "on" or "on," "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element, there are no intervening elements present. For purposes of this disclosure, the term "connected" may refer to physically connected, electrically connected, and the like, with or without intervening components.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the presence of stated features, integers, steps, operations, elements, components and/or groups thereof are stated but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.

Fig. 1 is a schematic diagram of an end-to-end text-to-image translation model training method according to an embodiment of the present disclosure. As shown in fig. 1, the end-to-end text-image translation model training method S10 includes the following steps.

In step S100, constructing a text image containing content of a given source language (hereinafter, an image containing text of a source language) for a text of the given source language, and constructing a sentence (i.e., a text of a source language) of a source language corresponding to the semantics of the text of the given source language (here, corresponding refers to identical and not identical, having similar semantics), includes the following processing steps.

Given a source language text, a source language text image containing the text is generated firstly, specifically, font information, background image information, font size information, font direction information, font color information, picture blurring degree and the like of the text in the text image are determined in a font library and a background image library, and the text image needing to be synthesized is superposed according to the effects.

And obtaining a source language text with similar semantics with the given source language sentence according to the two text machine translation models. Specifically, two text machine translation models are trained, namely a source-to-target text machine translation model and a target-to-source text machine translation model, a given source-end language sentence is translated to a target-end language through the source-to-target text machine translation model, and then the translated target-end language sentence is translated back to a source-end sentence with similar semantics through the target-to-source text machine translation model. And obtaining a plurality of sentences of the source end language with similar semantics by setting the column size of the text machine translation column search.

Preferably, in step S100, data augmentation is further performed on the source text image and the source language to solve the problem of sparseness of image text translation data. FIG. 12 is a schematic diagram of data augmentation, as shown in FIG. 12, given that "Happily, it all company right well" generates two text images that contain the source text content, but different fonts, backgrounds, etc.: by a translation method, a source language text with similar semantics to 'Happily, it all wet real well': "Happily, evenything well," fig. 13 is an example of various effects of synthesizing text images. In step S102, an image including a source language text is preprocessed to obtain a sub-image including the source language text, and a source language text corresponding to the image including the source language text is preprocessed to obtain a preprocessed text string.

In step S102, the image including the source language text is preprocessed to obtain a sub-image including the source language text, and a specific implementation method is as shown in fig. 2, and includes the following steps.

In step S1022, image preprocessing is performed on the input source language text image, including: and carrying out size adjustment on the image containing the source language text by an image scaling method. In particular, the image size variation is an enlargement or reduction of the image to a predefined image size, e.g. a predefined image size ofimg _W ×img _H ，img _W For a predefined image width, img _H Is a predefined image height.

In step S1024, the position of the region where the source language text in the image is located is obtained by a text detection method, and the region where the source language text is located is subjected to image segmentation to obtain sub-images. First, the position of the text in the image is detected through text detection, for example, the position of the contour point of the sub-image containing the text in the input image containing the source language is obtained through a text detection algorithm. Then, the image region containing the text is segmented by image segmentation to obtain a sub-image.

In step S1026, the texts in the sub-images are rearranged in a predetermined direction. Mapping the coordinates of the text outline and the corresponding pixels in the original image to a horizontal position through coordinate change, and simultaneously discarding the pixel values of the non-text region; for the text image after horizontal mapping, estimating missing pixels by using an interpolation algorithm;

for each image, after image preprocessing, the obtained output result is as follows: each image is represented as a matrix I of size I _W ×I _H X C, wherein I _W ，I _H C respectively represents the width, height and color channel of the sub-image obtained by segmentation, the image has no RGB color image, and correspondingly, the color channel is RGB three channels. Image size adjustment, detection, interception and correction of texts in the images. Text detection in an image detects the location of text in an image. Correction is to re-render the text in the segmented sub-images in a horizontal manner. For each image, after image preprocessing, the obtained output result is as follows: each image is represented as a matrix I of size I _W ×I _H X C, wherein I _W ，I _H And C respectively represents the width, height and color channel of the sub-image obtained by segmentation, the image has no RGB color image, and correspondingly, the color channel is RGB three channel.

In step S102, a source language text corresponding to an image including the source language text is preprocessed to obtain a text character string after preprocessing, and a specific embodiment is as shown in fig. 3, which includes the following steps.

In step S1021, the punctuation marks included in the source language text corresponding to the image including the source language text are normalized.

In step S1023, the source language text corresponding to the image containing the source language text is segmented.

In step S1025, it is determined whether each word belongs to an unknown word after the word segmentation processing, and if it belongs to an unknown word, the unregistered word is replaced with a token.

The unknown words refer to words which appear in the source language text and can not be matched with the words in the standard vocabulary library (to-be-matched vocabulary library). After the processing of step S102, the output of the text preprocessing is: each sentence is expressed as a processed character string T (the character string length is the sentence length L divided according to the corresponding word segmentation unit _T )。

In step S104, the sub-image containing the source language text is encoded by the image encoder to obtain the image feature, and the preprocessed text string is encoded by the text encoder to obtain the text feature. In a specific implementation, the image encoder is shown in FIG. 9. In a specific implementation, the text encoder is shown in FIG. 8. The following is a specific embodiment of step S104.

The feature coding is carried out on the input text image containing the source language through an image coder, the network structure of the image coder is not limited, the image coder can be an image coder based on a convolutional neural network, and can also be an image coder based on a self-attention structure, and the image features obtained through the coding of the image coder are generally a multi-channel feature map (the dimension is F) _W ×F _H ×F _C Wherein F is _W ，F _H ，F _C Width, height and number of characteristic channels of the characteristic diagram), before inputting into the sequence characteristic encoder, the characteristic dimension of the characteristic diagram needs to be adjusted to be affine transformation

Wherein

Respectively representing the hidden layer dimension and the sequence length of the image features.

The text input to the source language is feature-coded by a text encoder, i.e. the input character string is vectorized. Specifically, each word segmentation unit (such as character unit, sub-word unit, word unit) is represented as a vector by embedding a matrix, and each sentence of the source language can be represented as a matrix with the dimension of

Wherein

Respectively representing the hidden dimension and the sequence length of the text feature. If a shared sequence feature encoder is used in the subsequent sequence feature encoding process, then

There is no limit to the feature dimensions of the image features and the text features.

In step S106, the image sequence feature encoder and the text sequence feature encoder encode the image feature and the text feature respectively, and obtain an image sequence feature corresponding to the image feature and a text sequence feature corresponding to the text feature. Preferably, the sequence encoder can share or use independent sequence encoders for image and text

The following is a specific embodiment of step S106.

In step S1062, it is determined whether the image sequence feature encoder and the text sequence feature encoder are the same sequence feature encoder. In a specific implementation, it may be determined whether the current sequence feature encoder shares the model parameter.

In step S1064, if the image sequence feature encoder and the text sequence feature encoder are the same sequence feature encoder, the image is subjected to feature transformationThe features and the text features are processed such that hidden dimensions of the image features and the text features remain consistent. In a specific implementation, if the model parameters are shared, the model parameters of the image sequence feature encoder and the text sequence feature encoder are shared according to the sharing setting. The input image features and text features are encoded, for example, using a self-attention based sequence feature encoder, and the encoded features are output as: each feature input is encoded as a matrix (F) _h ×F _L ，F _h Hidden layer feature dimension for sequence feature encoder, F _L The length of the signature code after encoding for the sequence signature encoder). In a specific implementation, a sequence feature encoder based on self-attention is shown in FIG. 10.

In step S1066, if the image sequence feature encoder and the text sequence feature encoder are different sequence feature encoders, the image feature and the text feature are encoded by the image sequence feature encoder and the text sequence feature encoder, respectively, and the encoded image feature and the encoded text feature are subjected to feature transformation processing, so that hidden layer dimensions of the image feature and hidden layer dimensions of the text feature are kept consistent. In a specific implementation, if the model parameters are not shared, different model parameters are used for different sequence feature encoders.

In step S108, different loss values are calculated based on the image sequence feature and the text sequence feature. The different loss values include image-to-image contrast loss, text-to-text contrast loss, image-to-text contrast loss, end-to-end text-to-image translation loss, and end-to-end text translation loss. Wherein the modality contrast learning includes: intra-modal and inter-modal contrast learning losses are computed between the text image and the sequence features of the text. The calculation method of each loss value is as follows.

A method of calculating a loss of contrast from image to image, comprising:

order to

Is the image characteristic of the ith picture,

to have the same

The image features of text images of similar semantics,

for the image features of other text images, the intra-modal loss between text images is calculated by

Where K is the size of the data pool sampled by the negative examples (for example, it may be set to be batch size), τ is the temperature hyper-parameter, d (·) is the similarity calculation function, and the similarity calculation function may be cosine similarity, euclidean distance similarity, or the like.

A method of calculating a loss of contrast between text and text, comprising:

order to

As a textual feature of the ith source language sentence,

to have the same

The text features of the text of similar semantics,

for text features of other text, intra-modal loss between texts is calculated as

Where K is the size of the data pool sampled by the negative examples (for example, it may be set to batch size), τ is the temperature hyperparameter, d (-) is the similarity calculation function, and the similarity calculation function may be cosine similarity, euclidean distance similarity, or the like.

A method of calculating loss of contrast between an image and text, comprising:

order to

For the ith image feature containing the text image of the source language,

for the ith text containing the text features of the source language text contained in the source language text image,

for the image features of other text images, the inter-modal loss between text image and text is calculated by

K is the size of a data pool sampled by the negative sample, tau is a temperature hyper-parameter, d (-) is a similarity calculation expression, and the similarity calculation expression comprises any one of cosine similarity and Euclidean distance similarity.

In step S1066, the end-to-end text image translation loss and the end-to-end text translation loss are calculated. Comprises the following steps.

Firstly, decoding is carried out through a decoder based on the image sequence characteristics and the text sequence characteristics to obtain a corresponding decoded target end language. Preferably, decoding the image sequence features and the text sequence features by a decoder comprises two steps: the method comprises the steps of firstly, judging whether a current decoder shares model parameters or not, if so, sharing the model parameters of the corresponding text image decoder and the corresponding text image decoder according to sharing setting, otherwise, using different decoder model parameters. Secondly, respectively decoding the characteristics of the text image sequence and the text sequence, wherein the decoded characteristics are output as a matrix with the dimensionality of

Wherein

Respectively, the hidden layer dimension and the sequence length, which characterize the decoder output. In a specific implementation, the decoder is shown in fig. 11.

Secondly, end-to-end text image translation loss and end-to-end text translation loss are calculated based on the decoded target language and the target language standard answers. In a specific implementation, the calculation method is as follows.

A method of computing end-to-end text image translation loss and end-to-end text translation loss based on a decoded target language, comprising: calculating the translation loss of the text image based on the standard answer of the target language and the target language obtained by feature decoding of the text image:

wherein the content of the first and second substances,

representing a training loss of the translation of the text image, I ⁱ ，Y ⁱ Respectively representing the i-th text image and the corresponding translation of the target language, | D _TIT And | represents the number of training samples contained in the text image translation data set. The standard answer of the target language, i.e. the target language in the case of a completely correct translation, may be different from the standard answer of the target language, but the target language actually decoded by the feature decoder may be different from the target language.

The method for calculating the end-to-end text translation loss comprises the following steps: calculating text translation loss based on standard answers of the target language and the target language obtained by feature decoding of the text:

wherein, the first and the second end of the pipe are connected with each other,

representing translation of textLoss of training, T ⁱ ，Y ⁱ Respectively representing the i-th source text sentence and the corresponding translation of the target language, | D _MT And | represents the number of training samples contained in the text translation data set. The standard answer of the target language, i.e. the target language in the case of a completely correct translation, may be different from the standard answer of the target language, but the target language actually decoded by the feature decoder may be different from the target language.

In step S110, a loss function is constructed based on the different loss values. In the specific implementation, the method comprises the following steps:

in step S1102, the different modal contrast losses are weighted and summed:

wherein, | D _TIT I is the number of training samples contained in the text image translation data set, and lambda _II ，λ _TT ，λ _TI And respectively representing the weight of the contrast loss in the image mode, the contrast loss in the text mode and the contrast loss between image texts.

In step S1104, the loss of the modality contrast learning and the loss of the translation are fused to obtain a final training loss, i.e., a loss function. The specific implementation mode is as follows: and carrying out weighted summation on the image translation loss, the modal contrast loss and the text translation loss to obtain a final training loss function:

wherein λ is _TIT ，λ _MCL ，λ _MT Weights representing text image translation loss, modal contrast loss, and text translation loss, respectively.

In step S112, the parameters of the training model are updated when training is performed by the training model based on the loss function.

The end-to-end text image translation model training method can fully utilize the semantic similarity relation between the text images and the texts, so that the model is promoted to learn the feature representation between the text images with similar semantics, between the texts with similar semantics and even between the text images with similar semantics and the texts, and the performance of the end-to-end text image translation model is improved. The method and the system have the advantages that the advantages of end-to-end text image translation are kept, compared with a cascading system, the space complexity and the time complexity of an end-to-end model are small, in addition, the training efficiency of the end-to-end model is improved, and the end-to-end text image translation model can learn better text image translation knowledge by introducing mode contrast learning.

Fig. 6 is a schematic structural diagram of an end-to-end text-image translation model training device according to an embodiment of the present disclosure.

As shown in FIG. 6, the end-to-end text-to-image translation model training device includes the following modules.

The preprocessing module 1002 is configured to preprocess an image including a source language text to obtain a sub-image including the source language text, preprocess the source language text corresponding to the image including the source language text, and obtain a preprocessed text character string.

The feature obtaining module 1004 obtains image features by encoding a sub-image containing a source language text with an image encoder, and obtains text features by encoding a preprocessed text string with a text encoder.

The sequence feature encoding module 1006 encodes the image features and the text features through an image sequence feature encoder and a text sequence feature encoder, respectively, and obtains image sequence features corresponding to the image features and text sequence features corresponding to the text features.

The loss calculation module 1008 calculates different loss values based on the image sequence characteristics and the text sequence characteristics. The loss calculating module 1008 includes a decoder module, and the decoder module performs feature decoding on the text image sequence feature and the text sequence feature respectively to obtain a decoded target language, where the decoded target language may have a certain difference from a standard answer of the target language. When the image translation loss and the text translation loss are calculated, calculation is carried out based on the decoded target language and the target language standard answer. The details are consistent with the calculation method in the end-to-end text-to-image translation model training method provided by the disclosure.

A loss function constructing module 1010, which constructs a loss function based on different loss values;

the training module 1012 updates parameters of the training model as it is trained by the training model based on the loss function.

In the embodiment of fig. 6, the processing procedures involved in the respective modules are consistent with the processing procedures of the end-to-end text image translation model training method based on state contrast learning provided by the present disclosure, and are not described herein again.

Fig. 7 is a schematic structural diagram of an end-to-end text-to-image translation model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, an image including a source language text is preprocessed and input to an image encoder, a source language text corresponding to the image including the source language text and the processed input text are processed by the respective encoders, outputs (image features and text features) obtained by the image encoder and the text encoder are input to a sequence encoder (sequence feature encoder) to obtain outputs (image sequence features and text sequence features), and different loss values (intra-modal loss, inter-modal loss) are calculated based on the outputs of the sequence encoders. Meanwhile, the outputs (image features and text features) of the text encoder and the image encoder are input to a decoder for decoding, a decoding result (translated target language) is obtained after the decoding by the decoder, and image translation loss and text translation loss are calculated based on the decoding result and the standard answer. Then, a loss function is constructed based on the image translation loss, the text translation loss, the inter-modality loss, and the intra-modality loss. Finally, updating parameters, namely: and returning training loss through a random gradient descent method, updating parameters of the model, and updating the parameters of the model by using a gradient descent optimization algorithm.

By the end-to-end text image translation model training method provided by the disclosure, the influence of fusing different training losses on the text image translation result is verified on the synthesized text image translation test set. Specific different settings are as follows.

Model set 1 is an end-to-end text-image translation model. Specifically, the pre-processing module uses a residual convolutional network (ResNet); the sequence characteristic encoder module uses an encoder based on a self-attention mechanism; the decoder module uses a self-attention mechanism based decoder. The training loss function contains only text-image translation losses.

Model set 2 is an end-to-end text-image translation model incorporating contrast loss within the text modality. In particular, the pre-processing module uses a residual convolutional network (ResNet); the sequence characteristic encoder module uses an encoder based on a self-attention mechanism; the decoder module uses a self-attention mechanism based decoder. The training loss function comprises text image translation loss, and only contrast loss in a text mode is used in mode loss.

Model set 3 is an end-to-end text-image translation model incorporating contrast loss within the text-image modality. Specifically, the pre-processing module uses a residual convolutional network (ResNet); the sequence characteristic encoder module uses an encoder based on a self-attention mechanism; the decoder module uses a self-attention mechanism based decoder. The training loss function comprises text image translation loss, and only the intra-mode contrast loss of the text image is used in the mode loss.

Model set 4 is an end-to-end text image translation model incorporating inter-modal contrast loss of text images and text. In particular, the pre-processing module uses a residual convolutional network (ResNet); the sequence characteristic encoder module uses an encoder based on a self-attention mechanism; the decoder module uses a self-attention mechanism based decoder. The training loss function comprises text image translation loss, and only the inter-modal contrast loss of the text image and the text is used in modal loss.

Model set 5 is an end-to-end text image translation model incorporating the full modal contrast loss. Specifically, the pre-processing module uses a residual convolutional network (ResNet); the sequence characteristic encoder module uses an encoder based on a self-attention mechanism; the decoder module uses a decoder based on a self-attention mechanism. The training loss function includes a text image translation loss, an intra-modal contrast loss using the text image in the modal loss, a text intra-modal contrast loss, and an inter-modal contrast loss between the text image and the text.

According to the model setting, the model setting 1 is an end-to-end text image translation model without introducing modal contrast learning, the model settings 2 and 3 introduce intra-modal contrast loss, the model setting 4 introduces inter-modal contrast loss, and the model setting 5 simultaneously introduces intra-modal and inter-modal contrast loss. The specific verification results are shown in table 1.

Table 1: results of the experiment

Table 1 shows the results of the experiments of the present invention, wherein the index is the BLEU value between the machine-translated translation and the standard translation (the larger the index value is, the better the index value is). The following verification conclusion can be drawn from table 1. (1) Model settings 2-5 are all improved compared with model setting 1, which shows that the integration of modality contrast learning can improve the translation effect of end-to-end text image translation. (2) The translation effect of the model setting 3 is better than that of the model setting 2, which indicates that the performance gain caused by contrast loss in the modes between the text images is larger, so that the feature learning of the text images should be strengthened in the model training process. (3) The model setting 4 has better effect than the model setting 3, which shows that the contrast loss between the introduced modes is caused, and the end-to-end text image translation model can learn better translation knowledge by comparing and restricting the feature learning between the text image and the text. (4) Model set 5 has a better translation effect than model set 4, indicating that intra-modality contrast loss and inter-modality contrast loss are complementary in improving the effect of end-to-end text image translation.

It should be noted that the mode of merging modality contrast learning is not limited to the combination of the training loss functions mentioned in this example, and intra-modality contrast loss and inter-modality contrast loss may be set by different weight parameters to obtain better merging of modality contrast information, so as to enhance performance of the corresponding task.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Compared with the existing method, the end-to-end text image translation model training method and the end-to-end text image translation model training device can efficiently utilize the mutual contrast relations among text images, among texts and between the text images and the texts, and promote the training and the learning of the end-to-end text image translation model by drawing up the feature expressions among the text images and the texts with similar semantics and drawing out the feature expressions among the dissimilar text images and the texts. From the verification results, the modal contrast learning is integrated in the training process of the end-to-end text image translation model, so that the effect of end-to-end text image translation can be effectively improved. In addition, the modal contrast learning only needs to add corresponding contrast loss calculation in the training process, and complexity in end-to-end text image translation decoding cannot be increased, so that the advantages of high efficiency in deployment and testing of the end-to-end text image translation model and light model structure can be kept.

a memory storing execution instructions;

a processor executing execution instructions stored by the memory to cause the processor to perform the method of any of the above.

According to yet another aspect of the present disclosure, there is provided a readable storage medium having stored therein execution instructions for implementing the method of any one of the above when executed by a processor.

Fig. 6 shows an exemplary diagram of an apparatus employing a hardware implementation of a processing system. The apparatus may include corresponding means for performing each or several of the steps of the flowcharts described above. Thus, each step or several steps in the above-described flow charts may be performed by a respective module, and the apparatus may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.

The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus 1100 couples various circuits including the one or more processors 1200, the memory 1300, and/or the hardware modules together. The bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

The bus 1100 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one connection line is shown, but no single bus or type of bus is shown.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementations of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).

The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Further, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware that is instructed to implement by a program, which may be stored in a readable storage medium, and when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

In the description of the present specification, reference to the description of "one embodiment/implementation", "some embodiments/implementations", "examples", "specific examples", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/implementation or example is included in at least one embodiment/implementation or example of the present application. In this specification, the schematic representations of the terms described above are not necessarily the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims

1. An end-to-end text-image translation model training method is characterized by comprising the following steps:

constructing an image containing a source language text for the text of a given source language, and constructing a source language text corresponding to the semantic meaning of the given source language text;

encoding the subimage containing the source language text by an image encoder to obtain image characteristics, and encoding the preprocessed text character string by a text encoder to obtain text characteristics;

respectively coding the image features and the text features through an image sequence feature coder and a text sequence feature coder to obtain image sequence features corresponding to the image features and text sequence features corresponding to the text features;

constructing a loss function based on the different loss values; and

updating parameters of a training model when training by the training model based on the loss function;

wherein calculating different loss values based on the image sequence features and the text sequence features comprises:

calculating the contrast loss between images, the contrast loss between texts and the contrast loss between images and texts based on the image sequence characteristics and the text sequence characteristics;

the image feature and the text feature are respectively encoded by an image sequence feature encoder and a text sequence feature encoder, and the image sequence feature corresponding to the image feature and the text sequence feature corresponding to the text feature are obtained, which includes:

if the image sequence feature encoder and the text sequence feature encoder are the same sequence feature encoder, processing the image features and the text features through feature transformation to enable hidden layer dimensions of the image features and hidden layer dimensions of the text features to be consistent; and

if the image sequence feature encoder and the text sequence feature encoder are different sequence feature encoders, the image features and the text features are respectively encoded through the image sequence feature encoder and the text sequence feature encoder, and feature transformation processing is performed on the encoded image features and the encoded text features so that hidden layer dimensions of the image features and hidden layer dimensions of the text features are kept consistent;

wherein the contrast loss between the image and the text is calculated based on the following method:

order to

For the ith image feature containing the text image of the source language,

for the image characteristics of other text images, the calculation formula of the inter-modal loss between the image and the text is

K is the size of a data pool sampled by a negative sample, tau is a temperature hyper-parameter, d (-) is a similarity calculation expression, and the similarity calculation expression comprises any one of cosine similarity and Euclidean distance similarity.

2. The method for training an end-to-end text-to-image translation model according to claim 1, wherein preprocessing an image containing a source language text to obtain a sub-image containing the source language text comprises:

obtaining the position of an area where a source language text in the image is located by a text detection method, and carrying out image segmentation on the area where the source language text is located to obtain sub-images; and

and rearranging the texts in the sub-images according to a preset direction.

3. The method for training an end-to-end text-to-image translation model according to claim 1, wherein preprocessing the source language text corresponding to the image containing the source language text to obtain a preprocessed text string comprises:

the unknown words refer to words which appear in the source language text and can not be matched with words in a standard vocabulary library.

4. The method for training the end-to-end text-to-image translation model according to claim 1, wherein calculating different loss values based on the image sequence features and the text sequence features comprises:

5. The method for training an end-to-end text-image translation model according to claim 1, wherein constructing a loss function based on the different loss values comprises:

6. An end-to-end text-to-image translation model training device, comprising:

the preprocessing module is used for preprocessing an image containing a source language text to obtain a sub-image containing the source language text, preprocessing the source language text corresponding to the image containing the source language text and obtaining a preprocessed text character string;

a loss function constructing module, which constructs a loss function based on the different loss values; and

the training module is used for updating parameters of the training model when training is carried out through the training model based on the loss function;

if the image sequence feature encoder and the text sequence feature encoder are different sequence feature encoders, the image features and the text features are encoded respectively by the image sequence feature encoder and the text sequence feature encoder, and the encoded image features and the encoded text features are subjected to feature transformation processing, so that hidden layer dimensions of the image features and the hidden layer dimensions of the text features are kept consistent;

order to

For the ith image feature containing a text image in the source language,

for the image features of other text images, the inter-modal loss between image and text is calculated by

7. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing execution instructions stored by the memory to cause the processor to perform the end-to-end text-to-image translation model training method of any of claims 1 to 5.

8. A readable storage medium, wherein the readable storage medium stores therein execution instructions, and the execution instructions are executed by a processor to implement the end-to-end text-image translation model training method according to any one of claims 1 to 5.