CN114821597A

CN114821597A - Text recognition method and device, storage medium and electronic equipment

Info

Publication number: CN114821597A
Application number: CN202210476246.0A
Authority: CN
Inventors: 尹成浩
Original assignee: Beijing Zhitong Oriental Software Technology Co ltd
Current assignee: Beijing Zhitong Oriental Software Technology Co ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-29

Abstract

The present disclosure relates to a method, an apparatus, a storage medium and an electronic device for text recognition, and relates to the technical field of image processing, including: acquiring a text image to be recognized; the text image is used as the input of a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model; the text recognition model is obtained by training a preset training model through a first target loss function and a second target loss function, the first target loss function is obtained by recognizing a text according to a target sample image corresponding to a sample image and a first sample corresponding to the target sample image, the second target loss function is obtained by recognizing a weight coefficient of each character in the text according to the sample image, a second sample corresponding to the sample image and the second sample, and the target sample image is an image obtained by processing the sample image according to one or more preset processing modes.

Description

Text recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for text recognition, a storage medium, and an electronic device.

Background

In some application scenarios, the text image to be recognized often contains print text and/or handwritten text. And the print characters have standard font specifications and limited font types. Compared with printed text, due to different writing habits of each person, various conditions such as continuous writing, deformation, correction and the like exist, so that the writing difference of different persons with the same character is larger.

The text recognition method in the related art has a good recognition effect on the printed text, but has a poor recognition effect on the handwritten text.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method, an apparatus, a storage medium, and an electronic device for text recognition.

In a first aspect, the present disclosure provides a method of text recognition, the method comprising: acquiring a text image to be identified; the text image is used as the input of a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model; the text recognition model is obtained by training a preset training model through a first target loss function and a second target loss function, the first target loss function is obtained according to a target sample image corresponding to a sample image and a first sample recognition text corresponding to the target sample image, the second target loss function is obtained according to the sample image, a second sample recognition text corresponding to the sample image and a weight coefficient of each character in the second sample recognition text, the target sample image is an image obtained by processing the sample image according to one or more preset processing modes, and different preset processing modes correspond to different target sample images.

Optionally, the text recognition model comprises: the image decoding device comprises an image preprocessing model, a feature extraction model, a sequence model and a decoding model, wherein the output end of the image preprocessing model is coupled with the input end of the feature extraction model, and the output end of the feature extraction model is coupled with the input end of the sequence model; an output of the sequence model is coupled to an input of the decoding model; the step of using the text image as an input of a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model includes: inputting the text image into the image preprocessing model, carrying out gray scale processing on the text image through the image preprocessing model to obtain a gray scale image with a fixed size corresponding to the text image, and converting the gray scale image into a gray scale image matrix to obtain a gray scale image rectangle output by the image preprocessing model; inputting the gray level image matrix into the feature extraction model, and performing down-sampling processing on the gray level image matrix through the feature extraction model to obtain a vector sequence output by the feature extraction model; inputting the vector sequence into the sequence model to obtain sequence characteristics output by the sequence model; and inputting the sequence characteristics into the decoding model to obtain a target recognition text corresponding to the text image output by the decoding model.

Optionally, the text recognition model is trained by: acquiring a sample image and a second sample identification text corresponding to the sample image; respectively processing the sample images according to one or more preset processing modes to obtain the target sample images; obtaining the first target loss function according to the target sample image and a first sample identification text corresponding to the target sample image; training the preset training model through the first target loss function to obtain a to-be-determined recognition model; obtaining a second target loss function according to the sample image, a second sample identification text corresponding to the sample image and a weight coefficient of each character in the second sample identification text; and training the to-be-determined recognition model through the second target loss function to obtain the text recognition model.

Optionally, the obtaining the first target loss function according to the target sample image and the first sample recognition text corresponding to the target sample image includes: determining a first probability value that each character in the first sample text is correctly identified according to the first sample text; the first sample text is obtained after the target sample image is input into the preset training model; determining a second probability value that the target sample image is correctly identified according to the first probability value; determining a correct first target probability value identified by the preset training model according to the second probability value; determining the first target loss function according to the first target probability value.

Optionally, the training the preset training model through the first target loss function to obtain the to-be-identified model includes: circularly executing the first model training step until the trained preset training model is determined to meet a first preset convergence condition according to the first target loss function, and taking the trained preset training model as the to-be-determined recognition model; the first model training step comprises: inputting a plurality of target sample images into the preset training model to obtain the first sample text corresponding to each target sample image output by the preset training model; determining a first loss value of the first sample text and the first sample recognition text according to the first target loss function; wherein the first loss value is used to characterize a degree of difference between the first sample text and the first sample identification text; and under the condition that the trained preset training model does not meet the first preset convergence condition according to the first loss value, updating the parameters of the preset training model according to the first loss value to obtain the trained preset training model, and taking the trained preset training model as a new preset training model.

Optionally, the obtaining the second target loss function according to the sample image, the second sample identification text corresponding to the sample image, and the weight coefficient of each character in the second sample identification text includes: determining a third probability value of correct recognition of each character in the second sample text according to the second sample recognition text; the second sample text is obtained after the sample image is input into the to-be-determined identification model; determining a fourth probability value that the sample image is correctly identified according to the third probability value; acquiring a weight coefficient of each character in the second sample recognition text; determining a second target probability value which is correctly identified by the to-be-identified model according to the fourth probability value and the weight coefficient; and determining the second target loss function according to the second target probability value.

Optionally, the obtaining a weight coefficient of each character in the second sample recognition text includes: acquiring the frequency value of each character in the second sample identification text; obtaining the highest frequency value and the lowest frequency value in the frequency values; and determining the weight coefficient according to the frequency value, the highest frequency value and the lowest frequency value.

Optionally, the training the to-be-determined recognition model through the second target loss function to obtain the text recognition model includes: circularly executing the second model training step until the trained to-be-determined recognition model is determined to meet a second preset convergence condition according to the second target loss function, and taking the trained to-be-determined recognition model as the text recognition model; the second model training step comprises: inputting a plurality of sample images into the to-be-identified model to obtain the second sample text corresponding to each sample image output by the to-be-identified model; determining a second loss value of the second sample text and the second sample identification text according to the second target loss function; wherein the second loss value is used to characterize a degree of difference between the second sample text and the second sample identification text; and under the condition that the trained to-be-determined recognition model does not meet the second preset convergence condition according to the second loss value, updating the parameters of the to-be-determined recognition model according to the second loss value to obtain the trained to-be-determined recognition model, and taking the trained to-be-determined recognition model as a new to-be-determined recognition model.

Optionally, the updating the parameters of the model to be identified according to the second loss value includes: and updating the parameters of the sequence model and the parameters of the decoding model in the undetermined identification model according to the second loss value.

In a second aspect, the present disclosure provides an apparatus for text recognition, the apparatus comprising: the image acquisition module is used for acquiring a text image to be identified; the text acquisition module is used for taking the text image as the input of a pre-trained text recognition model so as to obtain a target recognition text corresponding to the text image output by the text recognition model; the text recognition model is obtained by training a preset training model through a first target loss function and a second target loss function, the first target loss function is obtained according to a target sample image corresponding to a sample image and a first sample recognition text corresponding to the target sample image, the second target loss function is obtained according to the sample image, a second sample recognition text corresponding to the sample image and a weight coefficient of each character in the second sample recognition text, the target sample image is an image obtained by processing the sample image according to one or more preset processing modes, and different preset processing modes correspond to different target sample images.

In a third aspect, the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising: a memory having a computer program stored thereon; a processor for executing the computer program in the memory to implement the steps of the method of the first aspect of the disclosure.

According to the technical scheme, firstly, a text image to be recognized is obtained; and secondly, the text image is used as the input of a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model. The text recognition model is obtained by training a preset training model through a first target loss function and a second target loss function, the first target loss function is obtained according to a target sample image corresponding to a sample image and a first sample recognition text corresponding to the target sample image, the second target loss function is obtained according to the sample image, a second sample recognition text corresponding to the sample image and a weight coefficient of each character in the second sample recognition text, the target sample image is an image obtained by processing the sample image according to one or more preset processing modes, and different preset processing modes correspond to different target sample images. By the method, the sample image is processed in one or more preset processing modes to obtain the target sample image, and the sample image is effectively expanded. And then, a first loss function is obtained according to the target sample image and the first sample identification text. Then, a second target loss function is obtained according to the sample image, the second sample identification text and the weight coefficient of each character in the second sample identification text. And then, training a preset training model according to the first target loss function and the second target loss function to obtain a text recognition model. Meanwhile, the second target loss function is combined with the weight coefficient of each character, so that the text recognition model can improve the accuracy of recognizing the low-frequency character, and finally, the accuracy of recognizing the text recognition model is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of text recognition in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating another method of text recognition in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a network configuration table of a feature extraction model in accordance with an illustrative embodiment;

FIG. 4 is a schematic diagram illustrating a CTC algorithm process flow, according to an exemplary embodiment;

FIG. 5 is a diagram illustrating a text recognition result in accordance with an illustrative embodiment;

FIG. 6 is a flow diagram illustrating a method of training a text recognition model in accordance with an exemplary embodiment;

FIG. 7 is a schematic diagram illustrating an image process according to an exemplary embodiment;

FIG. 8 is a schematic diagram illustrating another image processing in accordance with an exemplary embodiment;

FIG. 9 is a flow diagram illustrating another method of training a text recognition model in accordance with an exemplary embodiment;

FIG. 10 is a flow diagram illustrating another method of training a text recognition model in accordance with an illustrative embodiment;

FIG. 11 is a block diagram illustrating an apparatus for text recognition in accordance with an exemplary embodiment;

FIG. 12 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

It should be noted that all actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

In the description that follows, the terms "first," "second," and the like are used for descriptive purposes only and are not intended to indicate or imply relative importance nor order to be construed.

Before describing the text recognition method, apparatus, storage medium, and electronic device provided by the present disclosure, an application scenario related to various embodiments of the present disclosure is first described. With the advent of the intelligent era, handwritten text recognition application scenes are more and more, for example, composition correction, automatic correction of test paper, test paper entry and the like in an education scene, handwritten bill recognition, card recognition, signature recognition and the like in business application all need to recognize handwritten text images. At present, a common text recognition model can often accurately recognize a print text, and print characters have standard font specifications and limited font types. Compared with printed text, due to different writing habits of different individuals, various situations such as continuous writing, deformation, correction and the like exist. Therefore, common text recognition models often cannot accurately recognize handwritten text.

In addition, most of the current mainstream academic research for handwritten text recognition is directed to english characters, compared with chinese characters, the types of english characters are few, the strokes of individual characters are few, and the character distribution is relatively uniform, so that neural networks can learn more easily. The most common characters of the Chinese characters are nearly two thousand, and the Chinese characters are often seriously unbalanced in distribution, and the high-frequency characters only account for a small part of the total categories of the characters, but are hundreds times or even thousands times of the low-frequency characters. Meanwhile, the writing habits of each person are different, so that the writing difference of the same character among different persons is large, a large amount of sample data is difficult to obtain, the number of samples for neural network learning is limited, high-frequency characters tend to be recognized after training, and the accuracy of low-frequency character recognition is low.

In order to solve the above problems, the present disclosure provides a method, an apparatus, a storage medium, and an electronic device for text recognition, in which a sample image is processed by one or more preset processing methods to obtain a target sample image, and the sample image is effectively expanded. And then, a first loss function is obtained according to the target sample image and the first sample identification text. Then, a second target loss function is obtained according to the sample image, the second sample identification text and the weight coefficient of each character in the second sample identification text. And then, training a preset training model according to the first target loss function and the second target loss function to obtain a text recognition model. Meanwhile, the second target loss function is combined with the weight coefficient of each character, so that the text recognition model can improve the accuracy of recognizing the low-frequency character, and finally, the accuracy of recognizing the text recognition model is improved.

Specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 1 illustrates a method of text recognition, which, as shown in FIG. 1, may include the steps of:

s101, acquiring a text image to be recognized.

The text image to be recognized may be an image containing handwritten text, for example, an image containing handwritten text obtained by shooting or scanning a test paper, a job, a ticket, or the like.

And S102, taking the text image as an input of a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model.

The text recognition model is obtained by training a preset training model through a first target loss function and a second target loss function, the first target loss function is obtained according to a target sample image corresponding to a sample image and a first sample recognition text corresponding to the target sample image, the second target loss function is obtained according to the sample image, a second sample recognition text corresponding to the sample image and a weight coefficient of each character in the second sample recognition text, the target sample image is an image obtained by processing the sample image according to one or more preset processing modes, and different preset processing modes correspond to different target sample images.

By adopting the method, the sample image is processed by one or more preset processing modes to obtain the target sample image, and the sample image is effectively expanded. And then, a first loss function is obtained according to the target sample image and the first sample identification text. Then, a second target loss function is obtained according to the sample image, the second sample identification text and the weight coefficient of each character in the second sample identification text. And then, training a preset training model according to the first target loss function and the second target loss function to obtain a text recognition model. Meanwhile, the second target loss function is combined with the weight coefficient of each character, so that the text recognition model can improve the accuracy of recognizing the low-frequency character, and finally, the accuracy of recognizing the text recognition model is improved.

In some embodiments, the text recognition model may include: the image decoding device comprises an image preprocessing model, a feature extraction model, a sequence model and a decoding model, wherein the output end of the image preprocessing model is coupled with the input end of the feature extraction model, and the output end of the feature extraction model is coupled with the input end of the sequence model; an output of the sequence model is coupled to an input of the decoding model. As shown in fig. 2, the step S102 of inputting the text image as a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model may include the following steps:

and S1021, inputting the text image into the image preprocessing model, performing gray processing on the text image through the image preprocessing model to obtain a gray image with a fixed size corresponding to the text image, and converting the gray image into a gray image matrix to obtain a gray image rectangle output by the image preprocessing model.

In this step, the text image may be subjected to a gray scale process by an image preprocessing model to obtain a gray scale image with a fixed size corresponding to the text image, where the fixed size may be a pixel size of the image, for example, a gray scale image with a fixed size of 800 × 32. And further converting the gray level image into a gray level image matrix so as to facilitate a subsequent feature extraction model to extract image feature information of the text image.

In addition, if the image width after the gray processing is smaller than the width value in the fixed size, the image size of the gray image can reach the fixed size by the method of zero padding on the right side in the prior art.

And S1022, inputting the grayscale image matrix into the feature extraction model, and performing downsampling processing on the grayscale image matrix through the feature extraction model to obtain a vector sequence output by the feature extraction model.

The feature extraction model may be, for example and without limitation, a 46-layer neural network built based on a ResNet residual structure.

Illustratively, as shown in fig. 3, a network configuration table of a feature extraction model is shown, which may contain 7 block residual block structures and 4 maxpool maximum pooling layer structures. Each of blocks 1 through 4 is followed by a maxpool layer (pool 1, pool2, pool3, and pool4 in fig. 3) that is used to downsample the grayscale image matrix. The parameter multiplied by each block represents the number of blocks, for example (block1 (x 3)) represents three blocks 1 connected in series. The meaning of the structure parameter (e.g., second column configurations in the table of fig. 3) for each block represents the convolution kernel size, number of convolution kernels, and convolution step size, respectively. For example, the structural parameters (3 × 3conv, 32, s ═ 1, 1) of the block1 block respectively indicate that the size of the convolution kernel is 3 × 3, the number is 32, and the step size of the convolution kernel is 1 in both the vertical direction and the horizontal direction. The meaning of the structural parameter of each maxpool layer respectively represents the size of the filter size of the pooling layer and the pooling step length. For example, the structural parameters (k 2 × 2, s2, 2) of the pool1 layer each indicate that the pooling layer filter size is 2 × 2, and the pooling step size is 2 in both the longitudinal and transverse directions. The pooling operation of each maxpool layer is used to preserve the maximum value of the feature value within the pooled region, thereby preserving the salient features of the image.

As is clear from fig. 3, the width × height of the text image input feature input to the feature extraction model is 800 × 32, and the width × height of the image output feature after the feature extraction model is 100 × 1, indicating that the height of the image is down-sampled by 32 times and the width is down-sampled by 8 times. Therefore, one pixel point in the width direction of the downsampled image corresponds to the width of 8 pixel points of the original image (namely, the text image), and the width range of the Chinese handwritten single character is relatively met. Therefore, the feature information of the current pixel includes the image information of 8 pixel widths at the corresponding position of the original image. Meanwhile, the feature extraction model also converts the image features into a sequence form (namely a vector sequence), so that the image features can be further extracted by a sequence model connected later. The foregoing examples are illustrative only, and the disclosure is not limited thereto.

And S1023, inputting the vector sequence into the sequence model to obtain the sequence characteristics output by the sequence model.

The sequence model may be, for example but not limited to, an RNN (current Neural Network; chinese: Recurrent Neural Network) Network structure, or may also be a double-layer bidirectional LSTM (Long Short Term Memory Network; chinese: Long-Short Memory Network) Network structure, and is obtained by training through a model training method in the prior art, which is not described herein again. By the sequence model, the context dependence relationship between characters can be well combined, namely the recognition result of a certain character does not only depend on the character before the character, but also depends on the character after the character, so that the character in the text image can be predicted according to the context information.

And S1024, inputting the sequence characteristics into the decoding model to obtain a target recognition text corresponding to the text image output by the decoding model.

Wherein the decoding model can be implemented based on, for example, but not limited to, a CTC (connected termination) algorithm. The decoding model may merge the same characters through the character merging rules of the CTC algorithm to form a final recognition result. If the real text label corresponding to the text image to be recognized is "hot alarm on street", taking the sequence length of the decoding model as 12 as an example, processing is performed based on a CTC algorithm, as shown in fig. 4, there may be several possible situations of "street & up & hot alarm & alarm" or "street & up & hot & alarm" in the recognition result. Wherein different characters can be segmented by adding the marker "&" to facilitate subsequent merging of repeated characters. Then, repeated characters in the possible recognition results can be merged to obtain "street & up & hot & alarm", then the marker "&" is removed to obtain the final recognition result "street hot alarm".

For example, if the sequence feature width output by the sequence model is 100, it can be known from the above example that one pixel corresponds to the width of one character. In some embodiments, a full connection layer may also be connected to identify the characters of each pixel point, that is, a result of identifying 100 characters in total. Secondly, if the output length is not aligned with the input, the output characters can be automatically aligned through an alignment rule preset by a CTC algorithm, so that a reasonable corresponding relation exists between the input and the output, and the problem of alignment and marking of images is solved.

As shown in fig. 5, the target recognition text output by the text recognition model is obtained after four different text images are input into the text recognition model (the lower part of each text image in fig. 5 is the target recognition text output by the text recognition model), and it can be seen that the text in the text image containing the handwritten text can be accurately recognized by the text recognition model.

FIG. 6 is a flowchart illustrating a method for training a text recognition model, according to an example embodiment, which may be trained in the following manner, as shown in FIG. 6:

and S1, acquiring a sample image and a second sample identification text corresponding to the sample image.

The sample image is an image containing handwritten text, and the second sample recognition text is a real text label corresponding to the handwritten text in the sample image. Illustratively, the second sample recognition text corresponding to the sample image shown in fig. 7 is "martial arts and plains, as early as the nineties of the last century, our country proposed to cancel agriculture, non-agriculture".

For example, the sample image may be obtained from a predetermined database, such as, but not limited to, a CASIA offline handwritten Chinese dataset that includes a CASIA-HWDB2.0-2.2 text dataset and a CASIA-HWDB1.0-1.2 single-character dataset.

And S2, processing the sample images according to one or more preset processing modes to obtain the target sample images.

Considering that the number of sample images is limited, and because each person has different writing habits, the same character cannot have a sufficient number of sample images, in this embodiment, the target sample image can be obtained by processing the sample images according to one or more preset processing manners, so as to expand the number of samples.

The preset processing mode may include the following two modes:

the first method is as follows: and (4) an image enhancement processing mode.

Illustratively, the image enhancement processing means may include one or more of: image blurring processing, image noise processing, image deformation processing, image addition background processing, or the like. As shown in fig. 7, the sample image, the image obtained by performing the image blurring process on the sample image, the image obtained by performing the image noise process on the sample image, the image obtained by performing the image deformation process on the sample image, and the image obtained by performing the image addition background process on the sample image in fig. 7 correspond to each other from top to bottom.

The second method comprises the following steps: synthesizing low frequency unconventional sentences

Illustratively, first, corpora in a preset corpus may be obtained. The predetermined corpus can be pre-established and includes a plurality of databases of different corpora. In order to enable the text recognition model to learn more information of the low-frequency characters, the corpus can be, for example, a sentence containing the low-frequency characters, so that the number of the low-frequency characters in the sample is increased, and the recognition accuracy of the low-frequency characters is improved.

Secondly, for each character in the corpus, a target single-character image can be obtained from one or more single-character images corresponding to the character in a preset single-character database corresponding to the sample image. The predetermined single-character database may be, for example, but not limited to, a CASIA-HWDB1.0-1.2 single-character data set. For example, for each character in the corpus, a target single-character image may be obtained from one or more single-character images corresponding to the character in a preset single-character database, where the one or more single-character images corresponding to each character are images of the character obtained through different writing manners, the writing manner corresponding to each single-character image is different, and each target single-character image may be randomly extracted from one or more single-character images.

Finally, the target single-character images can be spliced according to the arrangement sequence of each character in the corpus to obtain the target sample image corresponding to the corpus, so that the number of samples is effectively expanded. Exemplarily, as shown in fig. 8, four target sample images obtained by stitching the obtained multiple target single-character images according to four different corpora are shown. Because each target single-character image is randomly acquired, the target sample image may contain characters corresponding to a plurality of different writing modes, so that the text recognition model can learn enough image characteristic information, and the recognition accuracy is improved.

The target sample image is obtained after the sample image is processed according to the one or more preset processing modes, so that the limited sample number can be effectively expanded, and the method is greatly helpful for improving the accuracy and robustness of the text recognition model recognition.

And S3, obtaining the first target loss function according to the target sample image and the first sample recognition text corresponding to the target sample image.

And the first sample identification text is a real text label corresponding to the handwritten text in the target sample image. According to the first target loss function obtained by the target sample image and the first sample recognition text, the target sample image is obtained by processing the sample image in one or more preset processing modes, namely, the target sample image is obtained by expanding the sample image, and the target sample image has more image characteristic information than the sample image. Because the image characteristic information of the extracted image is mainly extracted through the characteristic extraction model in the model, the preset training model is trained through the first target loss function, and the characteristic extraction model part is mainly trained, so that the characteristic extraction model can learn enough image characteristic information.

For example, the decoding model recognition result may include multiple results, for example, the recognition result includes several possible results such as "& street & up & hot & alarm", "street & up & hot & alarm", and the like, and the first sample text corresponding to the target sample image can be obtained after the results are processed by the CTC algorithm. Therefore, when the probability that the recognition result is correct is higher, the result of the current target sample image recognition (i.e. the first sample text) is closer to the first sample recognition text, and the accuracy of the text recognition model recognition is higher. Therefore, as shown in fig. 9, the obtaining of the first target loss function according to the target sample image and the sample identification text in the step S3 may include the following steps:

s31, according to the first sample recognition text, a first probability value that each character in the first sample text is correct for recognition is determined.

And the first sample text is obtained after the target sample image is input into the preset training model.

And S32, determining a second probability value that the target sample image is correctly identified according to the first probability value.

For example, the first probability value that each character recognition is correct may be multiplied to obtain a second probability value that the target sample image recognition is correct.

And S33, determining the correct first target probability value identified by the preset training model according to the second probability value.

In this step, the second probability values of the multiple target sample images that are correctly identified may be added to obtain the first target probability value of the preset training model that is correctly identified.

For example, the correct first target probability value identified by the preset training model may be obtained according to the second probability value by the following formula:

where Y represents a first sample recognition text, X represents a target sample image, p (Y | X) represents the first target probability value, a _t First sample text, p, representing the correspondence of the target sample image at time t (one character at each time) _t (a _t | X) represents the first probability value, A _(X,Y) Representing a target sample image set (the target sample image set comprises a plurality of target sample images), and A represents one target sample image in the target sample image set.

S34, determining the first target loss function according to the first target probability value.

For example, the first target loss function may be obtained from the first target probability value by the following formula:

where loss1 represents the first target loss function, Y represents the first sample identification text, X represents the target sample image, A _(X,Y) Representing a target sample image set (the target sample image set comprises a plurality of target sample images), and p (Y | X) represents the first target probability value.

And S4, training the preset training model through the first target loss function to obtain a to-be-determined recognition model.

Exemplarily, the first model training step may be executed in a loop until it is determined that the trained preset training model satisfies the first preset convergence condition according to the first target loss function, and the trained preset training model is used as the to-be-determined recognition model;

the first model training step includes:

and inputting a plurality of target sample images into the preset training model to obtain the first sample text corresponding to each target sample image output by the preset training model. The model structure of the preset training model may refer to the model structure illustrated in fig. 3.

A first loss value of the first sample text and the first sample identification text is determined according to the first target loss function. Wherein the first loss value is used for characterizing the difference degree between the first sample text and the first sample recognition text.

And under the condition that the trained preset training model does not meet the first preset convergence condition according to the first loss value, updating the parameters of the preset training model according to the first loss value to obtain the trained preset training model, and taking the trained preset training model as a new preset training model.

And S5, obtaining the second target loss function according to the sample image, the second sample identification text corresponding to the sample image and the weight coefficient of each character in the second sample identification text.

For example, as shown in fig. 10, the obtaining of the second target loss function according to the sample image, the second sample identification text corresponding to the sample image, and the weight coefficient of each character in the second sample identification text in the step S5 may include the following steps:

and S51, determining a third probability value that each character in the second sample text is correctly recognized according to the second sample recognition text.

And the second sample text is obtained after the sample image is input into the undetermined recognition model. Since the target sample image is an image processed in a preset processing manner, and for the sample image, the target sample image has destroyed the structure of the original character distribution, in order to enable the sequence model and the decoding model in the model to learn more information of low-frequency characters on the basis of the original character distribution, the second target loss function in this embodiment may be obtained according to a second sample text obtained after the sample image is input into the to-be-identified recognition model.

And S52, determining a fourth probability value for the sample image to identify correctly according to the third probability value.

For example, the third probability value that each character recognition is correct may be multiplied to obtain a fourth probability value that the sample image recognition is correct.

And S53, acquiring the weight coefficient of each character in the second sample recognition text.

Specifically, considering that the frequency value difference of the appearance of different single characters in the second sample recognition text is large, the frequency value of the appearance of a part of high-frequency characters is close to 1, and the frequency of the appearance of a part of low-frequency characters is close to 0. Thus, in some embodiments, first, a frequency value of occurrence of each character in the second sample identification text may be obtained. Second, the highest frequency value and the lowest frequency value of the frequency values may be obtained, such that the frequency values are constrained to be between (0, 1) according to the highest frequency value and the lowest frequency value. Then, the weight coefficient is determined according to the frequency value, the highest frequency value and the lowest frequency value.

For example, the weight coefficient of each character can be obtained according to the frequency value, the highest frequency value and the lowest frequency value by the following formula:

wherein, w _t Representing the weight coefficient, r representing the frequency value at which the character occurs, r _min Represents the lowest frequency value, r _max Representing the highest frequency value.

By the above formula, the weight range of each character finally obtained is (0.2, 0.7), that is, the weight coefficient of the high frequency character is made lower, and the weight coefficient of the low frequency character is made higher. Therefore, different weighting processing can be performed on the characters with different frequency values, so that the information of the low-frequency characters can be more concerned by the to-be-identified model on the basis of the distribution of the original characters in the training process, and the accuracy of identifying the low-frequency characters is improved.

S54, determining a second target probability value which is correctly identified by the undetermined identification model according to the fourth probability value and the weight coefficient;

where M denotes a second sample recognition text, N denotes a sample image, p (M | N) denotes a second target probability value, w _t Weight coefficient representing the character at time t (one character at each time), b _t A second sample text, p, representing the sample image at time t (one character at each time instant) _t (b _t | N) represents the third probability value, B _(N,M) Representing a sample image set (the sample image set comprises a plurality of sample images), and B represents one sample image of the sample image set.

And S55, determining the second target loss function according to the second target probability value.

Where loss2 represents the second target loss function, M represents the second sample recognition text, N represents the sample image, B _(N,M) Representing a sample image set (the sample image set comprises a plurality of sample images), p (M | N) representing the second target probability value.

And S6, training the undetermined recognition model through the second target loss function to obtain the text recognition model.

Exemplarily, the second model training step may be executed in a loop until it is determined that the trained to-be-determined recognition model satisfies a second preset convergence condition according to the second target loss function, and the trained to-be-determined recognition model is used as the text recognition model;

the second model training step includes:

inputting a plurality of sample images into the to-be-determined recognition model to obtain a second sample text corresponding to each sample image output by the to-be-determined recognition model;

determining a second loss value of the second sample text and the second sample identification text according to the second target loss function; wherein the second loss value is used for representing the difference degree between the second sample text and the second sample identification text;

and under the condition that the trained to-be-determined recognition model does not meet the second preset convergence condition according to the second loss value, updating the parameters of the to-be-determined recognition model according to the second loss value to obtain the trained to-be-determined recognition model, and taking the trained to-be-determined recognition model as a new to-be-determined recognition model.

In an actual scene, the problem of uneven character distribution has the greatest influence on a sequence model and a decoding model in a model to be identified, and the influence on a feature extraction model is small. Therefore, in some embodiments, the parameters of the sequence model and the parameters of the decoding model in the to-be-identified model may be updated according to the second loss value, and the parameters in the feature extraction model may be kept unchanged.

Fig. 11 illustrates an apparatus for text recognition, according to an example embodiment, which includes, as shown in fig. 11:

an image obtaining module 1101, configured to obtain a text image to be identified;

a text obtaining module 1102, configured to use the text image as an input of a pre-trained text recognition model, so as to obtain a target recognition text corresponding to the text image output by the text recognition model;

Optionally, the text recognition model comprises: the image decoding device comprises an image preprocessing model, a feature extraction model, a sequence model and a decoding model, wherein the output end of the image preprocessing model is coupled with the input end of the feature extraction model, and the output end of the feature extraction model is coupled with the input end of the sequence model; an output of the sequence model is coupled to an input of the decoding model;

the text obtaining module 1102 is configured to input the text image into the image preprocessing model, perform gray processing on the text image through the image preprocessing model to obtain a gray image with a fixed size corresponding to the text image, and convert the gray image into a gray image matrix to obtain a gray image rectangle output by the image preprocessing model; inputting the gray level image matrix into the feature extraction model, and performing down-sampling processing on the gray level image matrix through the feature extraction model to obtain a vector sequence output by the feature extraction model; inputting the vector sequence into the sequence model to obtain sequence characteristics output by the sequence model; and inputting the sequence characteristics into the decoding model to obtain a target recognition text corresponding to the text image output by the decoding model.

Optionally, the text recognition model is trained by:

acquiring a sample image and a second sample identification text corresponding to the sample image;

respectively processing the sample images according to one or more preset processing modes to obtain target sample images;

obtaining the first target loss function according to the target sample image and a first sample identification text corresponding to the target sample image;

training the preset training model through the first target loss function to obtain a to-be-determined recognition model;

obtaining a second target loss function according to the sample image, a second sample identification text corresponding to the sample image and the weight coefficient of each character in the second sample identification text;

and training the to-be-determined recognition model through the second target loss function to obtain the text recognition model.

Optionally, the obtaining the first target loss function according to the target sample image and the first sample recognition text corresponding to the target sample image includes:

determining a first probability value that each character in the first sample text is correctly recognized according to the first sample recognition text; the first sample text is obtained after the target sample image is input into the preset training model;

determining a second probability value for correct identification of the target sample image according to the first probability value;

determining a correct first target probability value identified by the preset training model according to the second probability value;

the first target loss function is determined based on the first target probability value.

Optionally, the training the preset training model through the first target loss function to obtain the to-be-determined recognition model includes:

circularly executing the first model training step until the trained preset training model is determined to meet a first preset convergence condition according to the first target loss function, and taking the trained preset training model as the to-be-determined recognition model;

the first model training step includes:

inputting a plurality of target sample images into the preset training model to obtain the first sample text corresponding to each target sample image output by the preset training model;

determining a first loss value of the first sample text and the first sample recognition text according to the first target loss function; wherein the first loss value is used for representing the difference degree between the first sample text and the first sample recognition text;

Optionally, the obtaining the second target loss function according to the sample image, the second sample identification text corresponding to the sample image, and the weight coefficient of each character in the second sample identification text includes:

determining a third probability value of correct recognition of each character in the second sample text according to the second sample recognition text; the second sample text is obtained after the sample image is input into the to-be-determined identification model;

determining a fourth probability value for the sample image to identify correctly according to the third probability value;

acquiring a weight coefficient of each character in the second sample identification text;

determining a second target probability value which is correctly identified by the to-be-identified model according to the fourth probability value and the weight coefficient;

determining the second target loss function according to the second target probability value.

Optionally, the obtaining a weight coefficient of each character in the second sample recognition text includes:

acquiring the frequency value of each character in the second sample identification text;

obtaining the highest frequency value and the lowest frequency value in the frequency values;

and determining the weight coefficient according to the frequency value, the highest frequency value and the lowest frequency value.

Optionally, the training the to-be-determined recognition model through the second target loss function to obtain the text recognition model includes:

circularly executing the second model training step until the trained to-be-determined recognition model is determined to meet a second preset convergence condition according to the second target loss function, and taking the trained to-be-determined recognition model as the text recognition model;

the second model training step includes:

Optionally, the updating the parameter of the pending identification model according to the second loss value includes:

and updating the parameters of the sequence model and the parameters of the decoding model in the pending identification model according to the second loss value.

By adopting the device, the sample image is processed by one or more preset processing modes to obtain the target sample image, and the sample image is effectively expanded. And then, a first loss function is obtained according to the target sample image and the first sample identification text. Then, a second target loss function is obtained according to the sample image, the second sample identification text and the weight coefficient of each character in the second sample identification text. And then, training a preset training model according to the first target loss function and the second target loss function to obtain a text recognition model. Meanwhile, the second target loss function is combined with the weight coefficient of each character, so that the text recognition model can improve the accuracy of recognizing the low-frequency character, and finally, the accuracy of recognizing the text recognition model is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 12 is a block diagram illustrating an electronic device 1200 in accordance with an example embodiment. As shown in fig. 12, the electronic device 1200 may include: a processor 1201 and a memory 1202. The electronic device 1200 may also include one or more of a multimedia component 1203, an input/output (I/O) interface 1204, and a communications component 1205.

The processor 1201 is configured to control the overall operation of the electronic device 1200, so as to complete all or part of the steps in the text recognition method. The memory 1202 is used to store various types of data to support operation of the electronic device 1200, such as instructions for any application or method operating on the electronic device 1200 and application-related data, such as contact data, messaging, pictures, audio, video, and so forth. The Memory 1202 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 1203 may include screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may further be stored in the memory 1202 or transmitted via the communication component 1205. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 1204 provides an interface between the processor 1201 and other interface modules, such as a keyboard, a mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 1205 is used for wired or wireless communication between the electronic device 1200 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 1205 can therefore include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 1200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described text recognition method.

In another exemplary embodiment, a computer-readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the text recognition method described above is also provided. For example, the non-transitory computer readable storage medium may be the memory 1202 described above including program instructions executable by the processor 1201 of the electronic device 1200 to perform the method of text recognition described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned method of text recognition when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of text recognition, the method comprising:

acquiring a text image to be identified;

the text image is used as the input of a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model;

2. The method of claim 1, wherein the text recognition model comprises: the image decoding device comprises an image preprocessing model, a feature extraction model, a sequence model and a decoding model, wherein the output end of the image preprocessing model is coupled with the input end of the feature extraction model, and the output end of the feature extraction model is coupled with the input end of the sequence model; an output of the sequence model is coupled to an input of the decoding model; the step of using the text image as an input of a pre-trained text recognition model to obtain a target recognition text corresponding to the text image output by the text recognition model includes:

inputting the text image into the image preprocessing model, carrying out gray scale processing on the text image through the image preprocessing model to obtain a gray scale image with a fixed size corresponding to the text image, and converting the gray scale image into a gray scale image matrix to obtain a gray scale image rectangle output by the image preprocessing model;

inputting the gray level image matrix into the feature extraction model, and performing down-sampling processing on the gray level image matrix through the feature extraction model to obtain a vector sequence output by the feature extraction model;

inputting the vector sequence into the sequence model to obtain sequence characteristics output by the sequence model;

and inputting the sequence characteristics into the decoding model to obtain a target recognition text corresponding to the text image output by the decoding model.

3. The method according to claim 1 or 2, wherein the text recognition model is trained by:

respectively processing the sample images according to one or more preset processing modes to obtain the target sample images;

obtaining a second target loss function according to the sample image, a second sample identification text corresponding to the sample image and a weight coefficient of each character in the second sample identification text;

4. The method of claim 3, wherein obtaining the first target loss function according to the target sample image and the first sample identification text corresponding to the target sample image comprises:

determining a first probability value that each character in the first sample text is correctly identified according to the first sample text; the first sample text is obtained after the target sample image is input into the preset training model;

determining a second probability value that the target sample image is correctly identified according to the first probability value;

determining the first target loss function according to the first target probability value.

5. The method of claim 6, wherein the training the preset training model by the first objective loss function to obtain a model to be identified comprises:

the first model training step comprises:

determining a first loss value of the first sample text and the first sample recognition text according to the first target loss function; wherein the first loss value is used to characterize a degree of difference between the first sample text and the first sample identification text;

6. The method of claim 3, wherein the deriving the second objective loss function according to the sample image, a second sample identification text corresponding to the sample image, and a weight coefficient of each character in the second sample identification text comprises:

determining a fourth probability value that the sample image is correctly identified according to the third probability value;

acquiring a weight coefficient of each character in the second sample recognition text;

and determining the second target loss function according to the second target probability value.

7. The method of claim 6, wherein obtaining a weight coefficient for each character in the second sample identification text comprises:

8. The method of claim 6, wherein the training the pending recognition model with the second objective loss function to obtain the text recognition model comprises:

the second model training step comprises:

inputting a plurality of sample images into the to-be-identified model to obtain the second sample text corresponding to each sample image output by the to-be-identified model;

determining a second loss value of the second sample text and the second sample identification text according to the second target loss function; wherein the second loss value is used to characterize a degree of difference between the second sample text and the second sample identification text;

9. The method of claim 8, wherein said updating parameters of said pending identification model based on said second loss value comprises:

and updating the parameters of the sequence model and the parameters of the decoding model in the undetermined identification model according to the second loss value.

10. An apparatus for text recognition, the apparatus comprising:

the image acquisition module is used for acquiring a text image to be identified;

the text acquisition module is used for taking the text image as the input of a pre-trained text recognition model so as to obtain a target recognition text corresponding to the text image output by the text recognition model;

11. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.

12. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 9.