CN115082937A

CN115082937A - End-to-end text recognition model training method, text recognition method and text recognition device

Info

Publication number: CN115082937A
Application number: CN202210704167.0A
Authority: CN
Inventors: 张宇轩; 林丽; 黄灿
Original assignee: Douyin Vision Beijing Co Ltd
Current assignee: Douyin Vision Beijing Co Ltd
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-09-20

Abstract

The application discloses an end-to-end text recognition model training method and device, wherein a target text line image input feature extraction module is used for obtaining a target input feature vector. And inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector. And repeating the operation on the first feature vector to obtain a second feature vector. And acquiring a target output characteristic vector based on the label corresponding to the target text line image. And inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result of the target text line image. And obtaining a loss value according to the label corresponding to the image and the prediction probability distribution result, and training a model based on the loss value. The target output characteristic vector is formed by splicing a real position vector and a real text content vector of each character in the image. The trained model can simultaneously predict the character positions and the text content of the characters, and the efficiency of text recognition can be improved.

Description

End-to-end text recognition model training method, text recognition method and text recognition device

Technical Field

The application relates to the technical field of image processing, in particular to an end-to-end text recognition model training method and device and an end-to-end text recognition method and device.

Background

The text detection Recognition technology (OCR) technology includes a text detection technology and a text Recognition technology. Text detection techniques are used to locate character positions in images, and text recognition techniques are used to identify text content in images. Wherein the text recognition comprises text line recognition. The end-to-end text recognition method refers to that text detection and text recognition are simultaneously realized through a network structure.

At present, the identification of text line contents and the positioning of character positions can be simultaneously realized through an end-to-end Pix2seq method based on a Transformer model. However, the time complexity of the Pix2seq method is high, so that the method is difficult to be applied in a real scene.

Disclosure of Invention

In view of this, embodiments of the present application provide an end-to-end text recognition model training method and apparatus, and an end-to-end text recognition method and apparatus, which can reduce time complexity of text detection and text recognition, and improve efficiency of text detection and text recognition.

In order to solve the above problem, the technical solution provided by the embodiment of the present application is as follows:

in a first aspect, an embodiment of the present application provides an end-to-end text recognition model training method, where the method includes:

inputting a target text line image into a feature extraction module to obtain a target input feature vector;

acquiring a target character position vector corresponding to the target text line image, and inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector;

acquiring a target output characteristic vector based on a label corresponding to the target text line image; the label corresponding to the target text line image comprises a real character position and real text content of each character in the target text line image; the target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector;

repeating the operation on the first feature vector to obtain a second feature vector; the dimension of the second feature vector is the same as the dimension of the target output feature vector;

inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image; the prediction probability distribution result comprises the prediction position probability distribution and the prediction text content probability distribution of each character in the target text line image;

obtaining a loss value according to the label corresponding to the target text line image and the prediction probability distribution result corresponding to the target text line image;

training the feature extraction module, the feature encoder and the feature decoder based on the loss value, repeatedly executing the step of inputting the target text line image into the feature extraction module, obtaining a target input feature vector and the subsequent steps until a preset condition is reached.

In a second aspect, an embodiment of the present application provides an end-to-end text recognition method, where the method includes:

acquiring a character position vector corresponding to an image to be recognized;

inputting the image to be recognized and the character position vector corresponding to the image to be recognized into an end-to-end text recognition model, and obtaining a probability distribution result of each character in the image to be recognized, which is output by the end-to-end text recognition model; the probability distribution result comprises the position probability distribution and the text content probability distribution of the characters;

acquiring a character detection result and a character recognition result of each character in the image to be recognized according to the probability distribution result of each character in the image to be recognized;

the end-to-end text recognition model is obtained by training according to any one of the end-to-end text recognition model training methods.

In a third aspect, an embodiment of the present application provides an end-to-end text recognition model training apparatus, where the apparatus includes:

the first acquisition unit is used for inputting the target text line image into the feature extraction module and acquiring a target input feature vector;

the second acquisition unit is used for acquiring a target character position vector corresponding to the target text line image, inputting the target input feature vector and the target character position vector into a feature encoder, and acquiring a first feature vector;

the third acquisition unit is used for acquiring a target output characteristic vector based on the label corresponding to the target text line image; the label corresponding to the target text line image comprises a real character position and real text content of each character in the target text line image; the target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector;

a fourth obtaining unit, configured to perform repeated operation on the first feature vector to obtain a second feature vector; the dimension of the second feature vector is the same as the dimension of the target output feature vector;

the input unit is used for inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image; the prediction probability distribution result comprises the prediction position probability distribution and the prediction text content probability distribution of each character in the target text line image;

a fifth obtaining unit, configured to obtain a loss value according to a label corresponding to the target text line image and a prediction probability distribution result corresponding to the target text line image;

and the training unit is used for training the feature extraction module, the feature encoder and the feature decoder based on the loss value, repeatedly executing the target text line image input feature extraction module, acquiring a target input feature vector and the subsequent steps until a preset condition is reached.

In a fourth aspect, an embodiment of the present application provides an end-to-end text recognition apparatus, where the apparatus includes:

the first acquisition unit is used for acquiring a character position vector corresponding to an image to be recognized;

a second obtaining unit, configured to input the image to be recognized and the character position vector corresponding to the image to be recognized into an end-to-end text recognition model, and obtain a probability distribution result of each character in the image to be recognized, where the probability distribution result is output by the end-to-end text recognition model; the probability distribution result comprises the position probability distribution and the text content probability distribution of the characters;

a third obtaining unit, configured to obtain a character detection result and a character recognition result of each character in the image to be recognized according to a probability distribution result of each character in the image to be recognized;

In a fifth aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the end-to-end text recognition model training method as described above, or the end-to-end text recognition method as described above.

In a sixth aspect, an embodiment of the present application provides a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements the end-to-end text recognition model training method as described in any one of the above, or the end-to-end text recognition method as described above.

In a seventh aspect, this application embodiment provides a computer program product, which is characterized in that the computer program product includes a computer program/instruction, and when executed by a processor, the computer program/instruction implements the end-to-end text recognition model training method described in any one of the above, or the end-to-end text recognition method described above.

Therefore, the embodiment of the application has the following beneficial effects:

the embodiment of the application provides an end-to-end text recognition model training method and device, wherein a target text line image is input into a feature extraction module to obtain a target input feature vector; acquiring a target output characteristic vector based on a label corresponding to the target text line image; and acquiring a target character position vector corresponding to the target text line image. And inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector, and further performing repeated operation on the first feature vector to obtain a second feature vector. And inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image. Obtaining a loss value according to a label corresponding to the target text line image and a prediction probability distribution result corresponding to the target text line image, and training a feature extraction module, a feature encoder and a feature decoder based on the loss value; and repeatedly executing the training process until a preset condition is reached. The target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector. The prediction probability distribution result output by the feature decoder comprises the prediction position probability distribution and the prediction text content probability distribution of each character in the target text line image. Therefore, the character position and the text content of each character in the text line can be simultaneously obtained in one decoding step length based on the trained end-to-end text recognition model, the complexity of text recognition is reduced, and the efficiency of text recognition is improved.

Drawings

Fig. 1 is a schematic diagram of an end-to-end Pix2seq method provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application;

FIG. 3 is a flowchart of a method for training an end-to-end text recognition model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a target output feature vector according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an end-to-end text recognition model according to an embodiment of the present disclosure;

fig. 6 is a flowchart of an end-to-end text recognition method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an end-to-end text recognition model training apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an end-to-end text recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic diagram of a basic structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

For the convenience of understanding and explaining the technical solutions provided in the embodiments of the present application, the background art related to the embodiments of the present application will be described first.

At present, the identification of the content of a text line and the positioning of the character position can be simultaneously realized by an end-to-end Pix2seq method based on a Transformer model. For example, the "on" character can be represented by five values in (0,1,32,30, \ u 9700). Wherein, (0,1,32,30) respectively represents the coordinate values of the upper left corner and the lower right corner of the character required in the text line image. \ u9700 indicates that the character is "need" (\ u9700 is the unicode code of "need"). The "about" character can be represented by five values in (32,1,63,30, \ u 8981). Wherein, (32,1,63,30) respectively represents the coordinate values of the upper left corner and the lower right corner of the character to be' in the text line image. \ u8981 indicates that the character is "to" (\ u8981 is the unicode code of "to").

The research of the applicant finds that the time complexity of the Pix2seq method is high, so that the method is difficult to be applied in a practical scene. Specifically, referring to fig. 1, fig. 1 is a schematic diagram of an end-to-end Pix2seq method provided in an embodiment of the present application. As shown in fig. 1, the decoder side of the transform model includes a decoder, a Linear layer, and a Softmax layer. In the Pix2seq method, taking the "required" character as an example, for each decoding step of the decoder, one input of the decoder is only a vector corresponding to one value of the character, for example, a vector corresponding to "0". Thus, in the Pix2seq method, each step can only predict one value of a character, and 5 decoding steps are required to obtain the coordinates and position of the "needed" character. Taking a plurality of characters such as 'required' character and 'required' character as an example, for each decoding step length of the decoder, the input of the decoder is a vector obtained by splicing vectors respectively corresponding to '0' and '32'. When the vector is input into a decoder, only one value of each character can be predicted at one time, and 5 decoding steps are needed for acquiring the coordinate and the position of the character required and the coordinate and the position of the character required. Thus, the Pix2seq method is time-complex and inefficient.

Based on the above, the embodiment of the application provides an end-to-end text recognition model training method and device, wherein a target text line image is input into a feature extraction module to obtain a target input feature vector; acquiring a target output characteristic vector based on a label corresponding to the target text line image; and acquiring a target character position vector corresponding to the target text line image. And inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector, and further performing repeated operation on the first feature vector to obtain a second feature vector. And inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image. Obtaining a loss value according to a label corresponding to the target text line image and a prediction probability distribution result corresponding to the target text line image, and training a feature extraction module, a feature encoder and a feature decoder based on the loss value; and repeatedly executing the training process until a preset condition is reached. The target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector. The prediction probability distribution result output by the feature decoder comprises the prediction position probability distribution and the prediction text content probability distribution of each character in the target text line image. Therefore, the character position and the text content of each character in the text line can be simultaneously obtained in one decoding step length based on the trained end-to-end text recognition model, the complexity of text recognition is reduced, and the efficiency of text recognition is improved.

In order to facilitate understanding of the end-to-end text recognition model training method provided in the embodiment of the present application, the following description is made with reference to a scenario example shown in fig. 2. Referring to fig. 2, the drawing is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.

In practical application, the end-to-end text recognition model comprises a feature extraction module, a feature encoder and a feature decoder.

And acquiring a target character position vector corresponding to the target text line image. And inputting the target text line image into a feature extraction module to obtain a target input feature vector. And acquiring a target output characteristic vector based on the label corresponding to the target text line image. The label corresponding to the target text line image includes the true position and the true text content of each character in the target text line image. For example, the label corresponding to the target text line image is (0,1,32,30, \ u 9700). (0,1,32,30) is the true position of the character and \\ u9700 is the true text content of the character.

And inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector. And then, repeating the operation on the first feature vector to obtain a second feature vector.

And inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image. And obtaining a loss value according to the label corresponding to the target text line image and the prediction probability distribution result corresponding to the target text line image, and training a feature extraction module, a feature encoder and a feature decoder based on the loss value. And repeatedly executing the training process until a preset condition is reached, and acquiring a trained end-to-end text recognition model.

Those skilled in the art will appreciate that the frame diagram shown in fig. 2 is only one example in which embodiments of the present application may be implemented. The scope of applicability of the embodiments of the present application is not limited in any way by this framework.

For the convenience of understanding, an end-to-end text recognition model training method provided by the embodiments of the present application is described below with reference to the accompanying drawings.

Referring to fig. 3, which is a flowchart of an end-to-end text recognition model training method provided in an embodiment of the present application, as shown in fig. 3, the method may include S301 to S307:

s301: and inputting the target text line image into a feature extraction module to obtain a target input feature vector.

In one or more embodiments, the end-to-end text recognition model of the embodiments of the present application is implemented by a Transformer model. The end-to-end text recognition model includes at least a feature extraction module, a feature encoder, and a feature decoder. Based on the end-to-end text recognition model, the text detection and the text recognition processes can be simultaneously realized, namely the character position and the text content of each character in the text line image can be simultaneously obtained.

It is to be understood that text recognition includes single character recognition and text line recognition, and embodiments of the present application relate to text line recognition.

And acquiring a target text line image, wherein the target text line image is an image used for training an end-to-end text recognition model. Before the end-to-end text recognition model is trained, the labels corresponding to the target text line images are already known. In a specific model training process, the image of the end-to-end text recognition model is trained together through the target text line image and the label corresponding to the target text line image. Wherein the target text line image comprises at least one text line image.

By way of example, the target text line image is a text line image that includes "how much money is needed? "six character text line image. The label corresponding to the target text line image is composed of a label value corresponding to each character in the target text line image. The label value corresponding to the character is the real character position and the real text content of the character, and the real position is represented by a real coordinate value. For example, the "desired" character corresponds to a tag value of (0,1,32,30, \ u9700), "desired" character corresponds to a tag value of (32,1,63,30, \ u8981), "many" character corresponds to a tag value of (65,0,93,1, \ u591A), "few" character corresponds to a tag value of (94,0,127,30, \\ u5C11), "money" character corresponds to a tag value of (129,0,155,31, \\ u94B1), "? "the character corresponds to a tag value of (157,0,173,30, \\ uFF 1F).

And the first four values in the label value of each character are the real coordinate values of the character in the target text line image. The first and second values are the coordinates of the upper left corner of the character in the target text line image, the third and fourth values are the coordinates of the lower right corner of the character in the target text line image, and the fifth value is the true text content of the character. For example, taking (0,1,32,30, \ u9700) as an example, (0,1) represents the upper left corner coordinates of the "on demand" character in the target text line image, and (32,30) represents the lower right corner coordinates of the "on demand" character in the target text line image.

Based on this, the label corresponding to the target text line image includes the real character position and the real text content of each character in the target text line image.

When the method is specifically implemented, the target text line image is input into the feature extraction module, and a target input feature vector is obtained. The feature extraction module is used for extracting visual features of the target text line image and expressing the visual features in a vector form. In one or more embodiments, the feature extraction module may be implemented by a convolutional neural network.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner of obtaining a target input feature vector by using a target text line image input feature extraction module, including:

acquiring a target text line image, and performing scaling operation and/or filling operation on the target text line image to obtain a preprocessed target text line image;

and inputting the preprocessed target text line image into a feature extraction module to obtain a target input feature vector.

The preprocessing operations include scaling operations and/or padding operations. It is to be understood that when the target text line image is a plurality of text line images, the lengths may differ between the different text line images. The length of the plurality of text line images can be made the same by performing the padding operation on the short text line images. In order to make the target text line image conform to the input dimension of the feature extraction module, dimension scaling operation is also required to be performed on the target text line image. In this way, the dimension of the target text line image after preprocessing can satisfy the input dimension of the feature extraction module.

In practical applications, when the target text line image includes a plurality of text line images, the batch of text line images can be directly processed simultaneously. At this time, the dimension of the target input feature vector is [ Batch _ size, Length, hidden _ dim ]. Where, Batch _ size is the number of Batch processing of the text line image. Length represents the Length of the sequence of text lines, e.g., when the text line image is "how much is needed? "six character text line image, Length is 6. hidden _ dim represents the hidden neuron dimension, e.g., hidden _ dim can be 128-dimensional.

S302: and acquiring a target character position vector corresponding to the target text line image, and inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector.

When the end-to-end text recognition model is implemented through the Transformer model, since the Transformer model adopts global information and cannot directly adopt sequence information of characters in a text line, a target character position vector corresponding to an image of the target text line needs to be input into the feature encoder. The target character position vector is used to represent the position where the character appears in the text line, either an absolute position or a relative position.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for obtaining a target character position vector corresponding to a target text line image, including:

and acquiring a target character position vector corresponding to the target text line image based on an absolute position coding algorithm or a relative position coding algorithm.

It is to be understood that, in addition, a specific position encoding algorithm may be determined according to practical situations, and the embodiment of the present application does not limit this.

In one or more embodiments, after a target character position vector corresponding to a target text line image is obtained, a target input feature vector and the target character position vector are spliced, and then the spliced vector is input into a feature encoder. And inputting the spliced vectors into a feature encoder to obtain first feature vectors. The feature encoder is used for further extracting semantic features of the spliced vectors so as to learn the internal relation of the target text line images.

The dimension of the target character position vector is [ Batch _ size, Length, hidden _ dim ], so that the target input feature vector and the target character position vector can be spliced. It will be appreciated that the first feature vector also has dimensions [ Batch _ size, Length, hidden _ dim ].

In one or more embodiments, the number of feature encoders is at least one, which is not limited herein and can be set according to practical situations. When the feature encoder is plural, the plural feature encoders are connected in series.

In a possible implementation manner, the present application provides a specific implementation manner of inputting a target input feature vector and a target character position vector into a feature encoder to obtain a first feature vector, which is described in detail in C1-C2 below.

S303: acquiring a target output characteristic vector based on a label corresponding to the target text line image; the label corresponding to the target text line image comprises the real character position and the real text content of each character in the target text line image; the target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector.

Since the feature decoder can only process numerical values, in order to enable the feature decoder to process the label corresponding to the target text line image, the target output feature vector needs to be obtained based on the label corresponding to the target text line image. The target output feature vector is used for representing a label corresponding to the target text line image.

The label corresponding to the target text line image comprises the real character position and the real text content of each character in the target text line image, and correspondingly, the obtained target output characteristic vector is formed by splicing the real position vector and the real text content vector corresponding to each character in the target text line image.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for obtaining a target output feature vector based on a tag corresponding to a target text line image, including:

a1: and constructing a target dictionary, converting the labels corresponding to the target text line images into corresponding numerical values in the target dictionary based on the target dictionary, and acquiring numerical value vectors corresponding to the labels.

In the embodiment of the present application, the target dictionary is a dictionary common to character positions and text contents of characters, and is also referred to as a token dictionary. Wherein one token is one marker. In the target dictionary, [0, num _ bin ] indicates different coordinate values, and is mainly used for a character prediction task. [ num _ bin +1, num _ token-2] represents different character text content for the character recognition task. [ num _ token-2, num _ token ] includes "Start" token, "End" token, "PAD" token. The "Start" token is a Start marker indicating the Start of model prediction. An "End" token is an End marker, indicating the End of model prediction. The "PAD" token is the label of each padding character in the padding operation. For example, when a filled-in character is an empty character, the "PAD" token is the label of the empty character.

In one or more embodiments, the values of num _ bin and num _ token may be chosen based on the actual task. For the inline Chinese and English text recognition and character positioning tasks, num _ bin can be taken as 512, num _ token can be taken as 11049, and the in-line Chinese and English text recognition and character positioning tasks are custom character sets containing common Chinese and English characters. Namely, [0, 512] represents different coordinate values, and [513,11047] represents different character text contents. For example, a dictionary value "0" represents a coordinate value of 0, and a dictionary value "1" represents a coordinate value of 1. The dictionary value "4999" represents the text content \ u9700 or "need" character.

After the target dictionary is built, the labels corresponding to the target text line images are converted into corresponding dictionary numerical values in the target dictionary based on the target dictionary, and then numerical value vectors corresponding to the labels are obtained.

For example, if the tag value corresponding to the "needed" character in the target text line image is (0,1,32,30, \ u9700), then the values in the target dictionary corresponding to the five values in (0,1,32,30, \\ u9700), such as 0,1,32,30,4999, can be searched according to the target dictionary. The remaining characters are similar, and thus, the labels corresponding to the target text line image can be converted to corresponding values in the target dictionary based on the target dictionary. That is, the label value corresponding to each character in the target text line image is converted into a dictionary numerical value in the target dictionary. And after the corresponding dictionary numerical values are obtained, the vector formed by the dictionary numerical values is the numerical value vector corresponding to the label.

A2: and converting the numerical value vector corresponding to the label into a vector of a high-dimensional space to obtain a target output characteristic vector.

After the numerical vector corresponding to the label is obtained, the vector is mapped to a high-dimensional space, that is, the numerical vector corresponding to the label is converted into a vector of the high-dimensional space, so as to obtain a target output feature vector. For example, each 1-dimensional value in the value vector corresponding to the tag may be converted to a representation in the dimension of hidden _ dim, which may be 128-dimensional. For example, each dictionary value in 0,1,32,30,4999 is represented by a 128-dimensional vector.

In one or more embodiments, the numerical vectors corresponding to the labels may be converted into vectors of a high-dimensional space based on nn. Where nn.

It can be understood that, since the tag corresponding to the target text line image includes the real character position and the real text content of each character in the target text line image. The numerical values in the obtained corresponding numerical value vector can also represent the real character position and the real text content corresponding to each character in the target text line image. Therefore, the target output characteristic vector obtained based on the numerical value vector is formed by splicing the real position vector corresponding to each character in the target text line image and the real text content vector.

The true position vector for a character is obtained by the dictionary value representing the true character position of the character, and similarly, the true text content vector for the character is obtained by the dictionary value representing the true text content of the character.

The dimension of the target output feature vector is [ Batch _ size, Length,5 × hidden _ dim ]. In addition, the number of channels of the target output feature vector is 3. Length represents the Length of the sequence of text. Each row of the target output feature vector represents a character. 5 high _ dim means that the number of columns of the target output feature vector is 5 high _ dim, i.e., 128 × 5. Each hidden _ dim in each row is used to represent a dictionary value for the character.

Referring to fig. 4, fig. 4 is a schematic diagram of a target output feature vector according to an embodiment of the present disclosure. As shown in fig. 4, taking the "required" character and the "required" character in the target text line image as an example, the label values corresponding to the two characters are both converted into numerical vectors, and then the two numerical vectors are converted into corresponding high-dimensional vectors. And respectively forming target output characteristic vectors by the high-dimensional vectors corresponding to the two characters. The dimensions of the target output feature vector are [2, 5 x 128 ]. The first line of the target output feature vector represents the "need" character and the second line represents the "need" character. The elements of 0-128 dimensions in the first row represent "0" of "required" character, "the elements of 129-256 dimensions represent" 1 "of" required "character," the elements of 257-384 dimensions represent "32" of "required" character, "the elements of 385-512 dimensions represent" 30 "of" required "character," and the elements of 513-640 represent "\\ u 9700" of "required" character. I.e. 0-512 dimensions represent the real character position of the "desired" character, and 513-640 dimensions represent the real text content of the "desired" character. The "want" character is similar and will not be described further herein.

S304: repeating the operation on the first feature vector to obtain a second feature vector; the dimension of the second feature vector is the same as the dimension of the target output feature vector.

After the first feature vector is obtained, the dimension of the first feature vector is [ Batch _ size, Length, hidden _ dim ]. And the dimension of the target output feature vector is [ Batch _ size, Length,5 × hidden _ dim ]. In order to enable the first feature vector to be input into the feature decoder, the first feature vector is repeatedly operated, and a second feature vector is obtained. And the dimension of the obtained second feature vector is the same as that of the target output feature vector.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for performing a repeat operation on a first feature vector to obtain a second feature vector, including:

and setting a repetition frequency parameter in the repeated operation function, inputting the first feature vector into the set repeated operation function, and acquiring a second feature vector.

In one or more embodiments, the Repeat operation function is a Repeat function, and the Repeat function returns a new character string obtained by repeating the original character string a certain number of times. The repetition parameter is the count parameter of the Repeat function. For example, the count parameter may take a value of 5, i.e., the first feature vector is repeated 5 times to obtain the second feature vector. The dimension of the second feature vector is also [ Batch _ size, Length,5 × hidden _ dim ].

It can be understood that in the embodiment of the present application, the character position information and the text content information about the target text line image input by the feature encoder may be shared, and the extracted first feature vector is a common feature, that is, the first feature vector may be used to train both the character position of the character and the text content of the character. In this way, the obtained second feature vector is considered to contain both the character position information of the target text line image and the text content information of the target text line image.

S305: inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image; the predicted probability distribution result includes a predicted position probability distribution and a predicted text content probability distribution for each character in the target text line image.

The network structure of the feature decoder is an autoregressive structure. And after the second feature vector, the target output feature vector and the target character position vector are obtained, inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image.

Wherein, the result of the prediction probability distribution is represented by a vector. The predicted probability distribution result includes a predicted position probability distribution and a predicted text content probability distribution for each character in the target text line image. For example, taking the "desired" character in the target text line image as an example, the prediction probability distribution of the "desired" character includes a prediction position probability distribution of each of four coordinate values for representing the character position of the character. If the dictionary value representing the coordinate value in the target dictionary is [0, 512] (num _ bin is 512), the probability distribution of the predicted position of the first coordinate value representing the character position of the "required" character is 512 probabilities corresponding to each dictionary value in [0, 512 ]. Similarly, the probability distribution of the predicted position of the second coordinate value is also the probability corresponding to each dictionary value in [0, 512], that is, 512 probabilities. The third coordinate value is similar to the fourth coordinate value, and is not described herein again.

It can be understood that the dictionary value with the highest probability value is the finally obtained prediction dictionary value, and the prediction coordinate value can be obtained according to the prediction dictionary value. For example, if the probability of the dictionary value "0" is the highest in the probability distribution of the predicted position of the first coordinate value, the dictionary value "0" is the predicted dictionary value corresponding to the first coordinate value, and if the dictionary value "0" corresponds to the coordinate value 0, the first predicted coordinate value of the character is 0.

The "desired" predicted text content probability distribution is a predicted text content probability distribution that represents a value of the text content of the character. If the dictionary value representing the text content in the target dictionary is [513,11047] (num _ token is 11049). The probability distribution of the 'required' predicted text content is the probability corresponding to each dictionary value in [513,11047], i.e. 10534 probabilities. It will be appreciated that the dictionary value with the highest probability value is the resulting predicted dictionary value. For example, the probability value corresponding to the dictionary value 4999 in [513,11047] is the highest, and the dictionary value "4999" is the predicted dictionary value corresponding to the text content, which corresponds to "\ u 9700". Thus, it is known that the predicted text content of the character is "on demand".

Thus, the character position recognition and the text content recognition of the characters are unified into a classification problem. Under the same network mechanism, OCR information of the target text line image can be simultaneously predicted end to end, and the OCR information comprises the character position and the text content of each character in the target text line image. Compared with a pure recognition task, on the basis of not increasing the decoding length, for each recognized character, the corresponding character position is output in an incremental mode, and the output information dimensionality is richer.

In one or more embodiments, the number of feature decoders is at least one, which is not limited herein and may be set according to practical situations. When the feature decoder is plural, the plural feature decoders are connected in series.

In a possible implementation manner, the embodiment of the present application provides a specific implementation manner that the second feature vector, the target output feature vector, and the target character position vector are input into a feature decoder to obtain a result of a prediction probability distribution corresponding to a target text line image, which is described in detail in D1-D4 below.

S306: and obtaining a loss value according to the label corresponding to the target text line image and the prediction probability distribution result corresponding to the target text line image.

After the prediction probability distribution result corresponding to the target text line image is obtained, a loss value can be obtained based on the label corresponding to the target text line image and the prediction probability distribution result corresponding to the target text line image, so as to train an end-to-end text recognition model through the loss value.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for obtaining a loss value according to a label corresponding to a target text line image and a prediction probability distribution result corresponding to the target text line image, including:

b1: and acquiring a real probability distribution result of the target text line image based on the label corresponding to the target text line image.

And mapping the label corresponding to the target text line image into a real probability distribution result. Specifically, the label corresponding to the target text line image is also expressed by a vector to obtain a real probability distribution result of the target text line image. The dimension of the real probability distribution result of the target text line image is the same as the dimension of the prediction probability distribution result corresponding to the target text line image.

Specifically, the probability of the dictionary numerical value corresponding to each of the real character position and the real text content in the real probability distribution result of the target text line image is set to 1, and the remaining probabilities are set to 0, so that the real probability distribution result of the target text line image is obtained.

B2: and acquiring cross entropy loss based on the real probability distribution result of the target text line image and the prediction probability distribution result corresponding to the target text line image.

And acquiring cross entropy loss based on the real probability distribution result of the target text line image and the prediction probability distribution result corresponding to the target text line image. After cross entropy loss is obtained, an end-to-end text recognition model is trained based on the cross entropy loss.

S307: and training the feature extraction module, the feature encoder and the feature decoder based on the loss value, repeatedly executing the feature extraction module for inputting the target text line image, acquiring a target input feature vector and performing the subsequent steps until a preset condition is reached.

An end-to-end text recognition model is trained based on the loss values, and in particular, an extraction module, a feature encoder, and a feature decoder are trained through the loss values. And judging whether a preset condition is reached or not in the process of training the end-to-end text recognition model. And the preset condition is a model training ending condition, if the preset condition is reached, the training of the end-to-end text recognition model is stopped, and the trained end-to-end text recognition model and the trained model parameters are stored.

As an alternative example, the preset condition is that the loss value reaches a preset threshold value. As another alternative example, the preset condition is that the number of training times reaches a preset number. It is understood that the preset threshold and the preset number can be set according to practical situations, and are not limited herein.

Based on the contents of S301 to S307, the embodiment of the present application provides an end-to-end text recognition model training method, where a target text line image is input to a feature extraction module to obtain a target input feature vector; acquiring a target output characteristic vector based on a label corresponding to the target text line image; and acquiring a target character position vector corresponding to the target text line image. And inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector, and further performing repeated operation on the first feature vector to obtain a second feature vector. And inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image. Obtaining a loss value according to a label corresponding to the target text line image and a prediction probability distribution result corresponding to the target text line image, and training a feature extraction module, a feature encoder and a feature decoder based on the loss value; and repeatedly executing the training process until a preset condition is reached. The target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector. The prediction probability distribution result output by the feature decoder comprises the prediction position probability distribution and the prediction text content probability distribution of each character in the target text line image. Therefore, the position and the content of each character in the text line can be simultaneously obtained in one decoding step length based on the trained end-to-end text recognition model, the complexity of text recognition is reduced, and the efficiency of text recognition is improved.

To facilitate understanding of the end-to-end text recognition model provided in the embodiment of the present application, referring to fig. 5, fig. 5 is a schematic structural diagram of an end-to-end text recognition model provided in the embodiment of the present application.

In one or more embodiments, when the end-to-end text recognition model is implemented by a Transformer model, the feature encoder may include a first multi-headed attention module and a first feed-forward network module, as shown in FIG. 5. Wherein the first Multi-Head Attention module is composed of a Multi-Head Attention layer and an Add & Norm layer. The first feedforward network module consists of a Feed Forward layer and an Add & Norm layer.

Based on this, an embodiment of the present application provides a specific implementation manner in which the target input feature vector and the target character position vector are input into the feature encoder in S302 to obtain the first feature vector, including:

c1: and splicing the target input feature vector and the target character position vector, inputting the spliced vector into the first multi-head attention module, and acquiring an output vector of the first multi-head attention module.

As an alternative example, the stitched vector is input into the Multi-Head Attention layer of the first Multi-Head Attention module. And then, the vector output by the Multi-Head Attention layer and the spliced vector are input into an Add & Norm layer of the first Multi-Head Attention module together, so that the Add & Norm layer sums and normalizes the input vector to obtain the output vector of the first Multi-Head Attention module.

C2: and inputting the output vector of the first multi-head attention module into the first feedforward network module to obtain a first feature vector output by the first feedforward network module.

As an alternative example, the output vector of the first multi-headed attention module is input to the Feed Forward layer of the first Feed Forward network module. And then, the vector output by the Feed Forward layer and the output vector of the first multi-head attention module are input into an Add & Norm layer of the first feedforward network module together, so that the Add & Norm layer sums and normalizes the input vectors, and a first feature vector output by the first feedforward network module is obtained.

Based on C1-C2, the first feature vector can be obtained by inputting the target input feature vector into the feature encoder.

In one or more implementations, when the end-to-end text recognition model is implemented via a Transformer model, the feature decoder may include a second multi-headed attention module, a third multi-headed attention module, and a second feed-forward network module, the end-to-end text recognition model further including a regression module. As an alternative example, the second Multi-headed Attention module consists of a Masked Multi-Head Attention layer and an Add & Norm layer. The third Multi-headed Attention module consists of a Multi-Head Attention layer and an Add & Norm layer. The second feedforward network module consists of a Feed Forward layer and an Add & Norm layer. The regression module consists of a Linear layer and a Softmax layer.

Based on this, in a possible implementation manner, the embodiment of the present application provides a specific implementation manner that the second feature vector, the target output feature vector, and the target character position vector are input to the feature decoder in S305 to obtain a result of a predictive probability distribution corresponding to the target text line image, including D1-D4:

d1: and splicing the target output feature vector and the target character position vector, inputting the spliced vector into the second multi-head attention module, and acquiring a third feature vector output by the second multi-head attention module.

As an alternative example, the target output feature vector and the target character position vector are concatenated. And inputting the spliced vectors into a mask Multi-Head Attention layer of the second Multi-Head Attention module, and inputting the vectors output by the mask Multi-Head Attention layer and the spliced vectors into an Add & Norm layer of the second Multi-Head Attention module together to obtain a third feature vector output by the second Multi-Head Attention module.

D2: and inputting the second feature vector and the third feature vector into a third multi-head attention module to obtain a fourth feature vector output by the third multi-head attention module.

As an alternative example, after obtaining the third feature vector, the second feature vector and the third feature vector are input into a Multi-Head Attention layer in a third Multi-headed Attention module. And then, the vector output by the Multi-Head Attention layer and the third feature vector are input into an Add & Norm layer of a third Multi-Head Attention module together to obtain a fourth feature vector output by the third Multi-Head Attention module.

D3: and inputting the fourth feature vector into the second feedforward network module to obtain a fifth feature vector output by the second feedforward network module.

As an alternative example, after the fourth feature vector is obtained, the fourth feature vector is input into the Feed Forward layer of the second Feed Forward network module. And then, inputting the vector output by the Feed Forward layer and the fourth feature vector into an Add & Norm layer of a second feedforward network module together to obtain a fifth feature vector output by the second feedforward network module.

D4: and inputting the fifth feature vector into a regression module, and acquiring a prediction probability distribution result corresponding to the target text line image output by the regression module.

As an optional example, the fifth feature vector is input into a Linear layer of the regression module, and then vectors output by the Linear layer and the Softmax layer are input into a Softmax layer of the regression module, so as to obtain a prediction probability distribution result corresponding to the target text line image output by the regression module.

It is understood that the second multi-headed attention module is used for learning the internal relationship of the tags corresponding to the target text line image, and the third multi-headed attention module is used for learning the relationship between the target text line image and the tags corresponding to the target text line image.

Based on the end-to-end text recognition model training method provided by the embodiment of the method, the embodiment of the application also provides an end-to-end text recognition method. In order to facilitate understanding of the present application, an end-to-end text recognition method provided by the embodiments of the present application is described below with reference to the drawings.

Referring to fig. 6, which is a flowchart of an end-to-end text recognition method provided in an embodiment of the present application, as shown in fig. 6, the method may include S601-S603:

s601: and acquiring a character position vector corresponding to the image to be recognized.

In one or more embodiments, the image to be recognized is a text line image. As an alternative example, the character position vector corresponding to the image to be recognized may be obtained by absolute position coding or relative position coding.

S602: inputting an image to be recognized and a character position vector corresponding to the image to be recognized into an end-to-end text recognition model, and obtaining a probability distribution result of each character in the image to be recognized, which is output by the end-to-end text recognition model; the probability distribution result includes a position probability distribution of the character and a text content probability distribution.

In specific implementation, an image to be recognized is input into the feature extraction module to obtain an input feature vector, the input feature vector is input into the feature encoder to realize further semantic feature extraction on the input feature vector, the feature vector output by the feature encoder is obtained, the feature vector is repeatedly operated, and a target feature vector meeting the dimensional requirement is obtained.

Based on the specific network structure of the end-to-end text recognition model shown in fig. 5, the high-dimensional vector corresponding to the "Start" token is input into the second multi-Start attention module of the feature decoder, and then the vector output by the second multi-Start attention module and the target feature vector are input into the third multi-Start attention module together, and subsequent operations are executed, so that the probability distribution result of the first character in the image to be recognized can be obtained finally. The probability distribution result includes a position probability distribution of the first character and a text content probability distribution.

And obtaining a prediction dictionary value of the first character according to the probability distribution result of the first character. The predictive dictionary values include four predictive dictionary values representing character positions of the first character and a predictive dictionary value representing text content of the first character. And then splicing the 'Start' token and the prediction dictionary numerical value of the first character, mapping the spliced vector to a high-dimensional space, and inputting the vector into a feature decoder again to obtain the prediction dictionary numerical value of the second character in the image to be recognized. By analogy, the prediction dictionary numerical value corresponding to each character in the image to be recognized can be finally obtained.

S603: and acquiring a character detection result and a character recognition result of each character in the image to be recognized according to the probability distribution result of each character in the image to be recognized.

In specific implementation, according to the probability distribution result of each character in the image to be recognized, the prediction dictionary numerical value of each character can be obtained. And then, according to the corresponding relation between the dictionary numerical value and the coordinate value and the corresponding relation between the dictionary numerical value and the text content, the character detection result and the character recognition result of each character in the image to be recognized can be obtained.

As an alternative example, as shown in fig. 5, the character detection result and the character recognition result of each character may also be visualized, and the visualized text content and the position of each character may be obtained. Wherein the position of each character can be represented by the box frame of each character as shown in fig. 5.

It is understood that the end-to-end text recognition model is trained according to the end-to-end text recognition model training method of any one of the above embodiments. For specific technical details, reference may be made to the above-described embodiments, which are not described herein again.

Based on the contents of the above S601-S603, the embodiment of the present application provides an end-to-end text recognition method, which is implemented based on the end-to-end text recognition model trained in the above embodiment. The method comprises the steps of firstly obtaining a character position vector corresponding to an image to be recognized, then inputting the image to be recognized and the character position vector corresponding to the image to be recognized into an end-to-end text recognition model, and obtaining a probability distribution result of each character in the image to be recognized, wherein the probability distribution result is output by the end-to-end text recognition model. Wherein the probability distribution result comprises the position probability distribution of the character and the text content probability distribution. Furthermore, according to the probability distribution result of each character in the image to be recognized, the character detection result and the character recognition result of each character in the image to be recognized can be obtained. Therefore, the position and the content of each character in the text line can be simultaneously obtained in one decoding step length based on the trained end-to-end text recognition model, the complexity of text recognition is reduced, and the efficiency of text recognition is improved.

In order to facilitate understanding that the end-to-end text recognition model provided in the embodiment of the present application can reduce the complexity of text recognition and improve the efficiency of text recognition, the calculation amount of the Pix2seq method and the calculation amount of the end-to-end text recognition model provided in the embodiment of the present application are described in a comparison manner.

In order to analyze the computational complexity of the Pix2seq method, the computational complexity of the conventional Transformer structure is analyzed. The structure of a traditional transform consists of an Embedding layer and several transform layers. The Embedding layer is less computationally intensive, and the transform layer is mainly analyzed here. On the basis of ignoring residual errors, layers with smaller calculated quantities such as Layer Normalization and the like, each transform Layer mainly comprises a Multi-Head attachment Layer and a Feed Forward Network Layer.

Setting n as the length of the input embedding sequence, d as head _ size, and h as the number of heads. d may be 64 and h may be 8. Then D is 64 × 8, 512, i.e., hidden _ size. Through theoretical analysis, the calculation amount of one layer of Encoder is 12nD ² +2n ² D。

Setting q as the length of the current output decoded output embedding sequence. The calculated amount of one layer of Decoder is 14qD through theoretical analysis ² +2q ² D+2nD ² +2nqD。

Set the maximum decoding length to L, N in total ₁ Layer Encoder, total N ₂ A layer Decoder. Get N ₁ ＝6，N ₂ 6, D512, L42 and n 128. Calculated according to the formula, the calculation amount of the traditional Transformer network during forward reasoning is 0.91GFLOPs (1GFLOPs ^ 10^9FLOPs, namely 10 hundred million floating point operations).

In the Pix2seq method, four coordinate values of each character share the same dictionary with the text content. As shown in fig. 1, the step size of Decode is increased by 5 times compared to the pure recognition task. In order to obtain the character position of a single word while recognizing the text content of the single word, the maximum decoding length of the Pix2seq method is changed from L to 5L. The corresponding calculation increases from 0.91GFLOPs to 35.8GFLOPs, increasing the time complexity by a factor of about 40.

As shown in fig. 4, it can be known that, in the end-to-end text recognition method provided in the embodiment of the present application, each decoding step can directly predict 5 tokens, i.e., the character position and the text content of a single character, such as the upper-left corner coordinate, the lower-left corner coordinate, and the text content of each character. The Pix2seq method requires 5 steps of prediction to obtain the character position and text content of a single character. Compared with a pure recognition task, the calculation amount of the end-to-end text recognition method provided by the embodiment of the application is increased from 0.91GFLOPs to 4.33GFLOPs, and the time complexity is increased by 5 times. It can be known that, compared with the Pix2seq method, the time complexity of the end-to-end text recognition method provided by the embodiment of the application is reduced.

Based on the end-to-end text recognition model training method provided by the embodiment of the method, the embodiment of the application also provides an end-to-end text recognition model training device, and the end-to-end text recognition model training device is explained by combining the attached drawings.

Referring to fig. 7, this figure is a schematic structural diagram of an end-to-end text recognition model training apparatus according to an embodiment of the present application. As shown in fig. 7, the end-to-end text recognition model training apparatus includes:

a first obtaining unit 701, configured to input a target text line image into a feature extraction module, and obtain a target input feature vector;

a second obtaining unit 702, configured to obtain a target character position vector corresponding to the target text line image, and input the target input feature vector and the target character position vector into a feature encoder, so as to obtain a first feature vector;

a third obtaining unit 703, configured to obtain a target output feature vector based on a tag corresponding to the target text line image; the label corresponding to the target text line image comprises a real character position and real text content of each character in the target text line image; the target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector;

a fourth obtaining unit 704, configured to perform a repeat operation on the first feature vector to obtain a second feature vector; the dimension of the second feature vector is the same as the dimension of the target output feature vector;

an input unit 705, configured to input the second feature vector, the target output feature vector, and the target character position vector into a feature decoder, and obtain a prediction probability distribution result corresponding to the target text line image; the prediction probability distribution result comprises the prediction position probability distribution and the prediction text content probability distribution of each character in the target text line image;

a fifth obtaining unit 706, configured to obtain a loss value according to a label corresponding to the target text line image and a prediction probability distribution result corresponding to the target text line image;

a training unit 707, configured to train the feature extraction module, the feature encoder, and the feature decoder based on the loss value, repeatedly execute the step of inputting the target text line image into the feature extraction module, and obtain a target input feature vector and subsequent steps until a preset condition is reached.

In a possible implementation manner, the third obtaining unit 703 includes:

the constructing subunit is used for constructing a target dictionary, converting the labels corresponding to the target text line images into corresponding dictionary numerical values in the target dictionary based on the target dictionary, and acquiring numerical vectors corresponding to the labels;

and the conversion subunit is used for converting the numerical vectors corresponding to the labels into vectors of a high-dimensional space to obtain target output characteristic vectors.

In a possible implementation manner, the fourth obtaining unit 704 is specifically configured to:

and setting a repetition frequency parameter in a repeated operation function, inputting the first feature vector into the set repeated operation function, and acquiring a second feature vector.

In one possible implementation, the feature encoder includes a first multi-headed attention module and a first feed-forward network module; the second obtaining unit 702 includes:

the first splicing subunit is configured to splice the target input feature vector and the target character position vector, input the spliced vector into the first multi-head attention module, and acquire an output vector of the first multi-head attention module;

and the first input subunit is used for inputting the output vector of the first multi-head attention module into the first feedforward network module and acquiring a first feature vector output by the first feedforward network module.

In one possible implementation, the feature decoder includes a second multi-headed attention module, a third multi-headed attention module, and a second feed-forward network module, and the end-to-end text recognition model further includes a regression module; the input unit 705 includes:

the second splicing subunit is configured to splice the target output feature vector and the target character position vector, input the spliced vectors into the second multi-head attention module, and acquire a third feature vector output by the second multi-head attention module;

a second input subunit, configured to input the second feature vector and the third feature vector into the third multi-head attention module, and obtain a fourth feature vector output by the third multi-head attention module;

a third input subunit, configured to input the fourth feature vector into the second feedforward network module, and obtain a fifth feature vector output by the second feedforward network module;

and the fourth input subunit is configured to input the fifth feature vector into the regression module, and obtain a prediction probability distribution result corresponding to the target text line image output by the regression module.

In a possible implementation manner, the fifth obtaining unit 706 includes:

the first acquiring subunit is configured to acquire a true probability distribution result of the target text line image based on a label corresponding to the target text line image;

and the second obtaining subunit is configured to obtain the cross entropy loss based on the actual probability distribution result of the target text line image and the prediction probability distribution result corresponding to the target text line image.

In a possible implementation manner, the first obtaining unit 701 includes:

the third acquisition subunit is used for acquiring a target text line image, and performing scaling operation and/or filling operation on the target text line image to acquire a preprocessed target text line image;

and the fourth acquisition subunit is used for inputting the preprocessed target text line image into the feature extraction module and acquiring a target input feature vector.

Based on the end-to-end text recognition method provided by the above method embodiment, the embodiment of the present application further provides an end-to-end text recognition apparatus, and the end-to-end text recognition apparatus will be described with reference to the accompanying drawings.

Referring to fig. 8, the figure is a schematic structural diagram of an end-to-end text recognition apparatus provided in an embodiment of the present application. As shown in fig. 8, the end-to-end text recognition apparatus includes:

a first obtaining unit 801, configured to obtain a character position vector corresponding to an image to be recognized;

a second obtaining unit 802, configured to input the image to be recognized and the character position vector corresponding to the image to be recognized into an end-to-end text recognition model, and obtain a probability distribution result of each character in the image to be recognized, which is output by the end-to-end text recognition model; the probability distribution result comprises the position probability distribution and the text content probability distribution of the characters;

a third obtaining unit 803, configured to obtain a character detection result and a character recognition result of each character in the image to be recognized according to a probability distribution result of each character in the image to be recognized;

Based on the end-to-end text recognition model training method and the end-to-end text recognition method provided by the embodiment of the method, the application also provides electronic equipment, which comprises the following steps: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the end-to-end text recognition model training method of any of the embodiments described above, or the end-to-end text recognition method of any of the embodiments described above.

Referring now to FIG. 9, shown is a schematic diagram of an electronic device 1300 suitable for use in implementing embodiments of the present application. The terminal device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (Portable android device), a PMP (Portable multimedia Player), a car terminal (e.g., car navigation terminal), and the like, and a fixed terminal such as a Digital TV (television), a desktop computer, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 9, electronic device 1300 may include a processing means (e.g., central processing unit, graphics processor, etc.) 1301 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1302 or a program loaded from a storage device 1306 into a Random Access Memory (RAM) 1303. In the RAM1303, various programs and data necessary for the operation of the electronic apparatus 1300 are also stored. The processing device 1301, the ROM1302, and the RAM1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

Generally, the following devices may be connected to the I/O interface 1305: input devices 1306 including, for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, and the like; an output device 1307 including, for example, a Liquid Crystal Display (LCD), speaker, vibrator, etc.; storage devices 1306 including, for example, magnetic tape, hard disk, etc.; and a communication device 1309. The communications device 1309 may allow the electronic device 1300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 9 illustrates an electronic device 1300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.

In particular, according to embodiments of the present application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication device 1309, or installed from the storage device 1306, or installed from the ROM 1302. The computer program, when executed by the processing apparatus 1301, performs the above-described functions defined in the methods of the embodiments of the present application.

The electronic device provided by the embodiment of the present application and the end-to-end text recognition model training method and the end-to-end text recognition method provided by the above embodiment belong to the same inventive concept, and technical details that are not described in detail in the present embodiment can be referred to the above embodiment, and the present embodiment and the above embodiment have the same beneficial effects.

Based on the end-to-end text recognition model training method and the end-to-end text recognition method provided by the above method embodiments, embodiments of the present application provide a computer readable medium, on which a computer program is stored, where the program is executed by a processor to implement the end-to-end text recognition model training method according to any of the above embodiments, or the end-to-end text recognition method according to any of the above embodiments.

It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the end-to-end text recognition model training method or the end-to-end text recognition method.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. Where the name of a unit/module does not in some cases constitute a limitation on the unit itself, for example, a voice data collection module may also be described as a "data collection module".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present application, [ example one ] there is provided an end-to-end text recognition model training method, the method comprising:

repeating the operation on the first characteristic vector to obtain a second characteristic vector; the dimension of the second feature vector is the same as the dimension of the target output feature vector;

According to one or more embodiments of the present application, in example two, there is provided an end-to-end text recognition model training method, where obtaining a target output feature vector based on a label corresponding to the target text line image includes:

constructing a target dictionary, converting the labels corresponding to the target text line images into corresponding dictionary numerical values in the target dictionary based on the target dictionary, and acquiring numerical value vectors corresponding to the labels;

and converting the numerical value vector corresponding to the label into a vector of a high-dimensional space to obtain a target output characteristic vector.

According to one or more embodiments of the present application, in example three, there is provided an end-to-end text recognition model training method, where repeating the operation on the first feature vector to obtain a second feature vector includes:

According to one or more embodiments of the present application, there is provided an end-to-end text recognition model training method, the feature encoder comprising a first multi-headed attention module and a first feed-forward network module; the inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector includes:

splicing the target input feature vector and the target character position vector, inputting the spliced vector into the first multi-head attention module, and acquiring an output vector of the first multi-head attention module;

and inputting the output vector of the first multi-head attention module into the first feedforward network module to obtain a first feature vector output by the first feedforward network module.

According to one or more embodiments of the present application, [ example five ] there is provided an end-to-end text recognition model training method, the feature decoder comprising a second multi-headed attention module, a third multi-headed attention module, and a second feed-forward network module, the end-to-end text recognition model further comprising a regression module; the inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image includes:

splicing the target output feature vector and the target character position vector, and inputting the spliced vector into the second multi-head attention module to obtain a third feature vector output by the second multi-head attention module;

inputting the second feature vector and the third feature vector into the third multi-head attention module to obtain a fourth feature vector output by the third multi-head attention module;

inputting the fourth feature vector into the second feedforward network module to obtain a fifth feature vector output by the second feedforward network module;

and inputting the fifth feature vector into the regression module, and acquiring a prediction probability distribution result corresponding to the target text line image output by the regression module.

According to one or more embodiments of the present application, an end-to-end text recognition model training method is provided, where the obtaining a loss value according to a label corresponding to the target text line image and a prediction probability distribution result corresponding to the target text line image includes:

acquiring a real probability distribution result of the target text line image based on the label corresponding to the target text line image;

and acquiring cross entropy loss based on the real probability distribution result of the target text line image and the prediction probability distribution result corresponding to the target text line image.

According to one or more embodiments of the present application, in an example seven, there is provided an end-to-end text recognition model training method, where the step of inputting a target text line image into a feature extraction module to obtain a target input feature vector includes:

According to one or more embodiments of the present application, [ example eight ] there is provided an end-to-end text recognition method, the method comprising:

According to one or more embodiments of the present application, [ example nine ] there is provided an end-to-end text recognition model training apparatus, the apparatus comprising:

According to one or more embodiments of the present application, there is provided [ example ten ] an end-to-end text recognition model training apparatus, the third obtaining unit including:

According to one or more embodiments of the present application, in example eleven, there is provided an end-to-end text recognition model training apparatus, where the fourth obtaining unit is specifically configured to:

According to one or more embodiments of the present application, [ example twelve ] there is provided an end-to-end text recognition model training apparatus, the feature encoder comprising a first multi-head attention module and a first feed-forward network module; the second acquisition unit includes:

According to one or more embodiments of the present application, [ example thirteen ] there is provided an end-to-end text recognition model training apparatus, the feature decoder comprising a second multi-headed attention module, a third multi-headed attention module, and a second feed-forward network module, the end-to-end text recognition model further comprising a regression module; the input unit includes:

the second splicing subunit is configured to splice the target output feature vector and the target character position vector, input the spliced vector into the second multi-head attention module, and acquire a third feature vector output by the second multi-head attention module;

According to one or more embodiments of the present application, in [ example fourteen ] there is provided an end-to-end text recognition model training apparatus, the fifth obtaining unit including:

the first obtaining subunit is configured to obtain a true probability distribution result of the target text line image based on a label corresponding to the target text line image;

According to one or more embodiments of the present application, [ example fifteen ] there is provided an end-to-end text recognition model training apparatus, the first obtaining unit including:

According to one or more embodiments of the present application, [ example sixteen ] there is provided an end-to-end text recognition apparatus, the apparatus comprising:

According to one or more embodiments of the present application, [ example seventeen ] there is provided an electronic device comprising:

one or more processors;

a storage device having one or more programs stored thereon,

According to one or more embodiments of the present application, example eighteen provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the end-to-end text recognition model training method as described in any of the above, or the end-to-end text recognition method as described above.

According to one or more embodiments of the present application, an example nineteenth provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the end-to-end text recognition model training method as described in any of the above, or the end-to-end text recognition method as described above.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system or the device disclosed by the embodiment, the description is simple because the system or the device corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for training an end-to-end text recognition model, the method comprising:

acquiring a target output characteristic vector based on a label corresponding to the target text line image; the label corresponding to the target text line image comprises the real character position and the real text content of each character in the target text line image; the target output characteristic vector is formed by splicing a real position vector corresponding to each character in the target text line image and a real text content vector;

2. The method of claim 1, wherein obtaining a target output feature vector based on a tag corresponding to the target text line image comprises:

3. The method of claim 1, wherein repeating the first eigenvector to obtain a second eigenvector comprises:

4. The method of claim 1, wherein the feature encoder comprises a first multi-headed attention module and a first feed-forward network module; the inputting the target input feature vector and the target character position vector into a feature encoder to obtain a first feature vector includes:

5. The method of claim 1, wherein the feature decoder comprises a second multi-headed attention module, a third multi-headed attention module, and a second feed-forward network module, the end-to-end text recognition model further comprising a regression module; the inputting the second feature vector, the target output feature vector and the target character position vector into a feature decoder to obtain a prediction probability distribution result corresponding to the target text line image includes:

splicing the target output feature vector and the target character position vector, inputting the spliced vector into the second multi-head attention module, and acquiring a third feature vector output by the second multi-head attention module;

6. The method according to any one of claims 1 to 5, wherein the obtaining a loss value according to the label corresponding to the target text line image and the result of the predictive probability distribution corresponding to the target text line image comprises:

7. The method according to any one of claims 1 to 5, wherein the inputting the target text line image into the feature extraction module to obtain the target input feature vector comprises:

8. A method for end-to-end text recognition, the method comprising:

wherein the end-to-end text recognition model is trained according to the end-to-end text recognition model training method of any one of claims 1 to 7.

9. An end-to-end text recognition model training apparatus, the apparatus comprising:

10. An end-to-end text recognition apparatus, the apparatus comprising:

a second obtaining unit, configured to input the image to be recognized and a character position vector corresponding to the image to be recognized into an end-to-end text recognition model, and obtain a probability distribution result of each character in the image to be recognized, where the probability distribution result is output by the end-to-end text recognition model; the probability distribution result comprises the position probability distribution and the text content probability distribution of the characters;

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the end-to-end text recognition model training method of any of claims 1-7, or the end-to-end text recognition method of claim 8.

12. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the end-to-end text recognition model training method of any one of claims 1-7, or the end-to-end text recognition method of claim 8.

13. A computer program product, characterized in that the computer program product comprises a computer program/instructions which, when executed by a processor, implements the end-to-end text recognition model training method of any of claims 1-7, or the end-to-end text recognition method of claim 8.